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Preface 


In the past decade, a gradual yet steady paradigm shift was the refocusing of tech- 
nologies from machine-oriented concepts, algorithms, and automats toward the max- 
imization of human potentials and the fulfilment of human needs and human 
well-being. Consequently, the development of computational power and computational 
efficiency has been set in the context of effortless technology utilization by both 
individuals and societies. With the recent advances in human-machine interfaces, 
wireless and mobile network technologies, and data analytics, we finally set out on the 
long journey toward making computer services truly human-centric, instead of limiting 
human capacities to suit computer requirements. Instead of playing a prominent role in 
the behavior of individuals and societies, machines and their very existence have 
become ever more subtle: smart devices have been weaved into the everyday fabric; 
analytic power has penetrated into all aspects of our daily activities from family to 
work (and everything in between); and automation has begun to displace segments of 
our routines. For an area as active as human-centered computing, it is not possible to 
cover the entire thematic spectrum. Instead, the Human-Centered Computing 
(HCC) conference aims to present a selection of examples of new approaches, methods, 
and achievements that can underpin the aforementioned paradigm shift. 

HCC 2017 was the third in the series, following successful events in Phnom Penh, 
Cambodia (2014), and Columbo, Sri Lanka (2016). The HCC 2017 papers present a 
balance between conceptual and empirical work, between design and evaluation 
studies, and between theoretical and applied research. 

All HCC 2017 submissions went through rigorous paper evaluation and selection 
process. Each paper was peer-reviewed by the Program Committee and selected 
reviewers and meta-reviewed by senior Program committee members. Based on these 
recommendations, the program co-chairs made acceptance decisions and classified the 
accepted papers into the following categories: full regular papers, short papers, and 
position papers. 

It has been another year of hard-working and selfless contribution. As the confer- 
ence Organizing Committee, we are grateful to all members of the Technical Program 
Committee — it was their hard work that enabled us to identify a set of high-quality 
submissions reflecting the trends and interests of the paradigm shift. We would like to 
extend our gratitude to the international Advisory Committee for its invaluable advice 
and guidance. Finally, our special thanks also goes to the additional reviewers, student 
volunteers and members of the local organization team, who are the key elements that 
made HCC2017 a successful event. 
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Bo Hu 

Vlada Kugurakova 
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IoT+AI: Opportunities and Challenges 


Huadong Ma 


School of Computer Science, Beijing University of Posts 
and Telecommunications, Beijing, China 
mhd@bupt.edu.cn 


The Internet of Things (IoT) can enable the interconnection and integration of the 
physical word and the cyber space, and has been widely considered as the kernel 
technology for sensing the urban environments and providing smart services further. At 
the same time, the rapid development of Artificial Intelligence (AI) brings many 
opportunities to IoT. In this talk, we first introduce the challenges of urban sensing 
networks. Combing AI theory, we discuss some researches on sensing, networking and 
computing, and service in the IoT environment. Finally, we outline the prospects of IoT 
development. 
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Machine Learning for Real-World: 
Is Deep Learning Enough? 


Adil Khan 


Department of Computer Science, Innopolis University, Innopolis, 
Respublika Tatarstan, Russian Federation 
a.khan@innopolis.ru 


The real world is unpredictable; it is full of noise and filled with novel scenarios. 
Therefore, it is almost impossible to provide the machine learning models with a 
complete representation of such a world at the time of training, forcing us to work with 
an insufficient picture of our world. That is why one of the biggest problems that the 
field of machine learning still faces today is the inability of the learned models to 
generalize well to scenarios that are different from the ones seen at the time of training. 
In this talk, we will explore this problem in four different areas: Human Activity 
Recognition, Emotion Recognition in Text, User Authentication, and Medical Image 
Analysis. We will see the results of applying deep learning models to these problems, 
and discuss ways that may be used in addition to the use of such models to achieve 
optimum performance. 
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An Improved MST Clustering Algorithm 
Based on Membrane Computing 


Ping Gong and Xiyu Liu 


School of Management Science and Engineering, 
Shandong Normal University, Jinan 250014, Shandong, China 
gongping sd@163. com, sdxyliu@163. com 


Abstract. MST clustering algorithm can detect data clusters with irregular 
boundaries. For a weighted complete graph the feasible solutions to the MST 
problem is non-unique. Membrane computing known for its characteristics of 
distribution and maximal parallelism can properly reduce the complexity of 
processing a MST of a graph. This paper combines MST clustering algorithm 
and membrane computing by designing a specific P system. The designed P 
system realizes the process of an improved MST clustering by collecting all 
feasible solutions to the MST problem together preserving proper edges and 
deleting redundant heavy edges. The improved MST clustering method effi- 
ciently enhances the quality of clustering and proved to be feasible through an 
instance. 


Keywords: MST clustering algorithm © Membrane computing 
P system 


1 Introduction 


Clustering is the process of partitioning a set of data objects into subsets, making each 
subset a cluster, such that objects in identical clusters are similar to each other, yet 
dissimilar to objects in other clusters. Clustering has been widely used in many 
applications such as business intelligence, image pattern recognition, Web search, 
biology, and security [1]. 

There are many kinds of algorithms in the literature to solve clustering problems, 
mainly including partitioning methods, hierarchical methods, density-based methods 
and grid-based methods and so on. Among various kinds of clustering methods, the 
minimum spanning tree (MST) clustering algorithm is known to be capable of 
detecting clusters with irregular boundaries, because it does not assume a spherical 
shaped clustering structure of the underlying data. MST constructing problems has 
been investigated by researchers since 1926 which also makes it a relatively mature 
algorithm. Many researches on this algorithm have promised close to linear time 
complexity of construction cost [2, 3]. Standard MST clustering algorithms basically 
include sorting the edges in constructed graph and removing the edges with heaviest 
weight. Applying MST algorithm to solve clustering problems has been investigated, 
but in practical applications, clustering results of MST algorithms are easy to be 
affected by outliers [3]. 


© Springer International Publishing AG 2018 
Q. Zu and B. Hu (Eds.): HCC 2017, LNCS 10745, pp. 1-12, 2018. 
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2 P. Gong and X. Liu 


Membrane computing is a new branch of natural computing, which have been 
investigated and proved to be universal and efficient. Membrane computing models 
(named P system) are motivated by the parallel characteristic of biochemical reactions 
taking place in a series of regions of a living cell. This computing theory is also 
motivated by the mathematical convenience of this kind of parallelism. P systems owe 
the characteristics of distribution and maximal parallelism [4]. Thus, membrane 
computing approaches are more suitable applied to combinatorial problems, graph 
theory, and finite state problems. 

On account of the non-uniqueness of feasible solutions to the MST problem and the 
lack of researches on MST clustering in membrane computing fields, this paper 
combines membrane structures with MST clustering algorithm together. Make P sys- 
tem an efficient computing tool during the process of solving MST clustering problem 
to reduce the time complexity. It is no doubt that combination will greatly improve the 
quality and efficiency of finding the best clustering results. 


2 Preliminary 


2.1 P System 


Membrane computing, introduced by PAUN in 1998, takes the living cell as multi- 
hierarchical structural regions, which are surrounded by the so-called membranes. 
There are three main investigated variants of P systems, cell-like, tissue-like, and 
neural-like. Cell-like P System imitates the function and structure of the cells, and it 
includes the membrane structure, rules and objects as basic elements. Cell-like 
arrangements of membranes correspond to trees. And some P systems have been 
proposed to solve computer science related problems, like NP problems, arithmetic 
operations, matrix vector computation and image processing. P systems have also been 
proved to be effective when being studied with clustering problems [6]. 

Membranes divide the whole system into different regions. The skin membrane is 
the outermost membrane. A membrane is a basic membrane if there are no membranes 
in it and a membrane is a non-elemental membrane otherwise. Rules and objects exist 
in regions. Usually the objects are indicated by strings or characters. Rules are used to 
process objects or membranes in corresponding region. The rules are executed 
uncertainly and maximum concurrently. Cell-like P System can be further divided into 
three types from according to kinds of rules: transition P system, P system with 
communication rules and P system with active membranes [7]. The basic membrane 
structure is shown in Fig. 1. 

In general, a P system of degree m is a construct: 


[I = (V, T; C,H, U, W1, W2,- ., Wm; (Ri, p1); (Ro, p2); E (Rm, Pm)) 


Where: 


1. V is an alphabet. Elements in it are called objects; 
2. T C V is the output objects; 


An Improved MST Clustering Algorithm Based on Membrane Computing 3 


Skin 





Elementary membrane 


Membranes 


Regions 


Environment 


Fig. 1. The basic membrane structure 
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. C C V -T is the catalyst. These catalysts neither change their numbers nor their 
kinds in rules. But rules cannot be executed without these catalysts; 

H = {1, 2, m} is the set of membrane labels. 

u is a membrane structure; each membrane has its label; 

wi(i = 1,2,m) is the objects in membrane i; 

The basic rule is in the form of (u — v), u is a string composed of objects in V and 
v is a string in the form of v = v’ or v = v'ð. v' is a string over {here, out, in € V, 
1 <j<m}. 

8. Riis the set of the rules in region 1. p; is the precedence relation which defines the 
partial order relation over R;. High priority rule is executed prior [8]. 


a 


The rules are used in maximum parallel and uncertainly in each membrane when 
calculating. So space of exponential growth can be generated in linear operation steps. 
This is very helpful to solve the computationally hard problems within feasible time. 
The P system will halt after some steps if no more rules can be executed and these 
objects in output membrane is the final result. The P system will not halt if rules 
are always executed, then this calculation is invalid, and there is no result being 
outputted [9]. 


2.2 MST Clustering Method 


A tree is a simple structure for representing binary relationships, and any connected 
component of a tree is called a sub-tree. A spanning tree is an acyclic connected 
sub-tree of a graph G, which contains all the vertices from G. And the minimum 
spanning tree (MST) is the one with the minimum weight. By representing the data set 
in a graph, finding the corresponding minimum spanning tree, the structure charac- 
teristics of data can be got to some extent. 


4 P. Gong and X. Liu 


MST related theories have been widely used for data classification in the field of 
pattern recognition and image processing for about forty years. As classical algorithms 
rely on either the idea of grouping data around some ‘centers’ or the idea of separating 
data points using some regular geometric curve like a hyper-plane, they generally do 
not work well when the boundaries of the clusters are very complex. An MST is quite 
invariant to detailed geometric changes in the boundaries of clusters. As long as the 
relative distances between clusters do not change significantly, the shape complexity of 
a cluster has very little effect on the performance of our MST-based clustering algo- 
rithms. And the process is quite simple. Remove the k—1 largest weighted edges from 
the constructed minimum spanning tree, and k sub-trees obtained can just be regarded 
as k clusters [10]. Two key advantages in representing a set of multi-dimensional data 
as an MST are: 1. The simple structure of a tree facilitates efficient implementations of 
rigorous clustering algorithms, which otherwise are highly computationally challeng- 
ing; and 2. As an MST-based clustering does not depend on detailed geometric shape 
of a cluster, it can overcome many of the problems faced by classical clustering 
algorithms. 

Representing a set of multi-dimensional data points as a simple tree structure will 
lose some of the inter-data relationship. But during the process of simplifying the data 
set into a MST, essential intra-data information is remained in MST tree. Each cluster 
corresponds to one sub-tree, which does not overlap the representing sub-tree of any 
other clusters. And through MST representation, we can convert a multi-dimensional 
clustering problem to a tree partitioning problem, just like finding a particular set of tree 
edges and cutting them. And finding a globally optimal solution for a combinatorial 
optimization problem is often possible. 


3 A P System for Improved MST Clustering Method 


3.1 The Improved MST Clustering Method 


MST clustering methods attempt to represents a data set into a tree graph with the 
minimum edge weights, then cut k—1 longest edges in it to form k clusters. In this 
section, we propose a new method to find a minimum spanning tree, and implement the 
MST constructing rules in a membrane system. By applying the membrane structure as 
a computing tool, we can find all feasible solutions to an MST problem in the process. 
And finally we optimize the process of cutting the longest k—1 edges to find a better 
clustering solution. 

Firstly, we measure similarity between data points as distances of objects, and 
express in a matrix W’, 


it, we Win 
/ / 
w! Woy Wop Won 
nn 
/ / / 
Wy 1 Wn2 Wan 


www.allitebooks.com 
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Where w; is the distance between point a; and aj. To facilitate calculation, W,,, is 
totally converted into corresponding integer matrix W,,, as follows: 


W11 W12 ... Win 
_ W21 Wn <.. Wn 

Hoy = 
Wnl Wm +++ Wm 


Accordingly, every w;; here means the weight of edge ej; connecting vertex a; and aj. 

The first part of MST clustering process is to construct minimum spanning trees: 
We set V as vertex set and E as edge set, and they both are empty sets initially; rank all 
edges in ascending order; then add the shortest edge e; (with the smallest weight) into 
the set E, and accordingly add the associated two vertex a; and a; into the set V; other 
edges are added into set E gradually, which are supposed to be connected to one of the 
existing vertices and with smallest weight in the remaining edges; after an edge is 
added into E, we add the other vertex connected to it into V; while once there are two 
identical vertices in V, it means that there is a circle in tree. And this edge will be 
abandoned. Above steps are repeated until |V| = n and |E| = n—1(|V| is the number of 
vertices in V; |E] is the number of edges in E). 

For a complete graph, especially a graph with a mass of vertex, there are edges with 
identical weights. Therefore, the minimum spanning tree of a given complete graph is 
not unique and only if the shape of the minimum spanning tree is not unique. To find 
all feasible solutions, we are supposed to choose every different edge with the same 
weights every time when constructing a MST. But obviously, such enumerating pro- 
cess will take too much time. 

One of the most advantages of membrane computing is its parallelism. Membrane 
structure is hierarchical, and computing rules in regions evolve react and communicate 
synchronously. For this reason, we apply this parallelism to constructing all minimum 
spanning trees of a complete graph at a time by duplicating and generating new reaction 
membranes with rules. When there are more than one edges can be chosen, we can 
construct new computing regions, where we add every feasible new edge into the tree. 

When the process of adding edges and vertices to corresponding sets ends, the next 
step is to partition the produced MST. An edge with a large weight means that the two 
vertices connected by it are distinct from each other. For this reason, we choose to 
delete k—1 maximum value edges from the existed set E to form k clusters. Since there 
are more than one MST feasible solutions collected, for these always appearing short 
edges, we can conclude that they are correct elements of sub-tree which should be 
preserved in partitions. We rearrange edges which used to belong to set E but now 
appear in different membranes as parts of MST solutions with their frequency of 
occurrences in descending order. Then we delete redundant edges one by one. For 
edges with the same value, we preferentially preserve ones with higher frequency; for 
edges with same frequency, we preferentially delete ones with heavier weights. To 
make sure that every point is kept in set V, when an edge is to be deleted, the existence 
of two adjacent vertices must be checked. Repeat this process until |E] = N—K. 
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The procedure of our method is as below: 


Input: G=< V, E>, k, n 
Output: k clusters 
Begin: 
Sort the edges with their w, in ascending order. 


Select the shortest e, initially 
If (the shortest e, 1s not unique) { 


Set every shortest edge as an initial edge in different computation branches. 


j 
Add e,into E. Add the connected a; a, into V. 


For (e, is the next shortest edge) { 
If(a,isin V and a,is notin V) { 
Add e, into E. Add a, into V. 
} else if (a,is in V and4, is not in V) { 
Add e,into E. Add a, into V. 


? else { 
Give upe,. // to make sure the tree connected and with no 


circle. 


j 


If (|V |n ) {Give up this computation branch. } 
Count the frequency f, ; of edges in different V. 


Sort edges with their frequency in descending order and keep the top n. 


For (e;) { 
K p" and w, > w,,,) {Give upe, } 
Until] E|=n-k. 
j 
End 


3.2 Definition of the P System for Improved MST Clustering Method 


P system deal with multi-sets of symbol objects in distributed and parallel manner. So 
we design a specific P system for the improved MST clustering method to collect all 
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Fig. 2. The initial state of the designed P system 


MST feasible solutions and partition the dataset. The designed P system is shown as 
Fig. 2. 


Il — (O, u, Mo, Mı, Mis, Ro, Ri, Rio, P) 


Where: 

O = {c114142. . .-An014} specifies the initial collection of objects in the P system; 
u= blli liliglo Specifies the initial membrane structure of the P system; 

Mo = {c11a1a2...a,} specifies the initial objects in membrane 0; 

Mı = {0,} specifies the initial object in membrane 1; 

Mi, = {4} specifies that there is nothing in the output membrane ig initially; 
Rule in Ro: 


rı = {eyaia; > c+) 414i [Ui lim l1 SbF Sn} U 


ij 
lanin Crne ll Sba n U aan T A] 


Rules in R: 
r = {01 U; > 0U |L<i,j<nj 


n= {CO; Ui, —>> GiGjViV;P jO2Ac|t = 0, 1 < LJ n} 
U {€0,U; it Ui, = [qiq Vi, Vj, Pj Ac Oo], ` + 14:4), Vi, Vj, Pij AcP2| | 
It = 0, 1 ry 


r4 = {602q;U;," — segu" <i,s<n}U 
{CO2qjU,;" > CO2q;U,; O ap <s,j <n} 


i= {COovivsU;, — COzviv,\t = 0,1 <i,s <n} 
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_ CO2Vv,U; T qsViVs@sPish2 t = 0, l < 1,8 <nyU 
s 


l 


{C02v;Uy = AsVsVjOsPi0o|t =NA RA S,J< n} 


r] = {02v o;PisAcl E ldsVsPisAcO2| , [A-026], |1 < i,j < n} U 
{02 vs@;Psj ‘A sie ldsVsP Ac 02], [A-026], |1 < i,j < n} 


rg ={0. > d'} rm = {qiviUj > Ala} ro = fP; = InP, la } 
ri = {d > ô} m={€—-d} rg = {Ae Ala} 
ra = {d —> ô} 


Rules in Rp: 


a. t pmı pm Mr t—-1 pm-l pm-l1 ._.. m,—1 
V5 Da {n ant = n E E E; | 


te N*,m €N*,m,>m>--- >m,,1<i,j<n} 


/ 
SE t—mMn (m) p” 2) (mn-1) Wiji pW i2ia ae 
eee — 
= {n a 2 i2j2 K —1jn-1 een Do ~ ap? Jat 


1<t,m€N* ,m>m> ie 2 Nip Veen} 


-1 -1 


Wiji pW iri2 in—1jn—1 Wiji Tt pizh T iz Ue 1 
Poe. Pp — PP. P. pe 





iji izj? * in—1Jn—1 tJi i2j2 in—1Jn— 
|1 Sioje S n} 
d ! ! 
- {P Wisijs ) (Wisy joy ) _ P Rata: (Wisisn y ) mre 
iji LJ lsJs In—1Jn—1 
/ 
w, —1 2 ty SVs za = 
n P. ( is} Jy ) piai ) I (Wisst) 5 
2 ls] iji lJa S In—1Jn-1 
n—k psn- a 
r19 7 {m Pi ia ARE i Te ae I is, Isp 1 =F 
it UE ee wpe. a <í < 
Us) Jsq Pi is ee l a İs, Js Ta nj 


By counting the amount of objects to find the shortest edges, the sorting procedure 
is combined with the tree construction. Meanwhile the parallelism of computation 
reduces the time complexity of finding all feasible solutions to O(n). 


3.3 Introduction of Computations 


Rule r1 is executed firstly in the environment to calculate the direct distance between 
any two points and send it to membranel. After objects u,’ are sent into membranel, 
rule r and r3 are activated to identify the shortest edge from all edges and generate two 
related points among these u,’ . If the number of edges with minimum weight of the 
complete graph is greater than 1, rule r3 generates same new membranes (membrane 1) 
for every initial minimum edge to continue the next steps. 
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Rule r4 continually identifies edges which are connected to the existed points and 
calculates which the minimum one is. When ui, = 0 appears, it means the next edge 
with minimum weight is find out. Adding a minimum weight edge, two identical 
objects v; may appear in same membrane. That means there is a circle in the minimum 
spanning tree. Rule rs are activated to dissolve the according uis to avoid forming a 
circle. Then rule rę generates relative objects to represent the edge and point. 

If the number of current edges with minimum weight is greater than 1, rule r7 
activates the current membrane to divide into two, with one containing all elements 
including the new added edge but another with all elements except the new edge. 

The number of object ¢ is set as n—1, and as every edge is added, it reduces by 1. 
When there is no ¢ in membranel, it means that n—1 edges have been added into the 
graph. Object 02 generates object d’ without the constraint of € which catalyzes other 
objects than P; to dissolve themselves. It also promote Pj; conveyed into membrane io. 
While if there are objects ¢ but no other edges to be added, rule r12 r13 r14 dissolve the 
current membrane and other objects. 

The second computing stage of this structure happens in membrane ig. Rule rg — 
rıı sends the chosen edges for a minimum spanning tree into membrane ig. In every 
membranel, executing rule rg — rı1, there is a feasible solution to the minimum 
spanning tree. As a consequence, there will be more than n—1 edges existing in 
membrane sent to ig, including identical edges. Rule r15 — r19 is executed in membrane 
io. The number of object 7 indicates the number of edges in membrane ig. We choose 
the edge with maximum weights and highest frequency. Rule rig executes until the 
cardinal of 7, reducing to n—k (n—1—(k—1) = n—k), which means there are only n—k 
different edges left. At this time, Rule r19 finds out the final n—k edges and preserves 
them in the result. 


4 Instance Analysis 


In order to verify whether the designed P system has better clustering effect, we 
introduce a simple example in this section. In this test, 10 points need to be divided into 
3 clusters, and data-points in same clusters are similar to each other but dissimilar to 
data-points in other clusters. The example points are shown as Fig. 3. 

The modified similarity matrix of this set is as below. 


0 1 1 8 10 13 10 10 17 25 
1 0 2 5 5 10 9 13 20 26 

12 0 5 9 8 5 5 10 16 

8 5 5 0 2 1 2 10 13 13 

w -|10 5 9 2 0 5 8 2 25 25 
L10 = |13 10 8 1 5 O 1 9 10 8 
10 9 5 2 8 1 0 4 5 5 

10 13 5 10 20 9 4 0 1 5 

17 20 10 13 25 10 5 1 0 2 

25 26 16 13 25 8 5 5 2 O 
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Fig. 3. Instance data set (a; = (1,2), a2 = (1,3), a3 = (2,2), a4 = (3,4), as = (2,5), aş = (4,4), 
a7 = (4,3), ag = (4,1), dg = (5,1), a10 = (6,2)). 


The designed P system realizes the minimum spanning tree clustering algorithm 

with its maximal parallelism and distributed manner. Now we give simple analysis of 
how the P system executives in this example. 
i 
1 to represent distances between data-points. Among all these u,’ (1 <i,j< 10), rules 
r2 r3 pick u2 U13 U46 U67 to start the construction, because they have the minimum 
weight 1. Therefore there are four membrane1 in parallel to continue next constructing 
steps, and each of them starts from different edges. For the membrane1 starting from 
u12, the edge with next minimum weight is e13. Then rule rs estimate whether adding 
e;3 Will form a circle or not. If a circle is to form, this edge will be abandoned. 
Otherwise, the edge is added. After the judgment, u;3 evolves into q3v3%@3P13 
according to rule re. Then the next object u23 is to be judged. Since there are vj v2V3 
here, rule rs dissolves u23, and the next edges with minimum weight 5-—edges 
€24 €25 €34 €27 €3g Continue this process. As there are more than one objects to be added, 
these previous objects in the membrane is copied 5 times and membrane | split into 5 
identical membranes according to rule r7. For one of them, for example, the one 
containing e25, the next minimum edge to be added is e54; and the one containing e34, 
e46 with weight 1 is next to be added. These procedures will repeat until no ¢ or uj is 
left. Finally there are 26 minimum spanning trees constructed as shown in Fig. 4. And 
according to rule rj9, edges in these trees are sent into membrane ig. The membrane 
structure containing all feasible solutions is shown as Fig. 4. 

The final partition step is executed in membrane 10 with rules r15 — rj9. In this 
example, we can see edges in all minimum spanning tree feasible solutions including 
€12 €13 €46 €78 €89 €910 €67 €45 €24 €25 €34 €38 €56 €37. They are sorted by rules in 
descending order by their cardinal and retained the first 9(n—1) edges only by rule 715 
and rj6. Rule 717 rig and rj9 pick the longest 2(k—1) edges — e7ge24 and delete them by 
their weights. Finally edges e12 €13 €46 ego €9,10 €67 €45 are reserved in the graph. The 
clustering result is {a1a2a3 },{a4a5a6a7 },{agag9a10 }. 


At first, rule rı in the environment sends objects u,” (1< i,j < 10) into membrane 
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Fig. 4. configuration after applying rule ro 





5 Discussion and Future Works 


In this paper, we construct a specific P system where all feasible solutions to the 
minimum spanning tree problem of a complete graph can be calculated. Then by 
integrating these feasible solutions and deleting some large weight edges the final 
partition result is formed in the output membrane. Through the example test, we can 
see that the designed P system can effectively find quality clustering result in a dis- 
tributed and parallel manner. The future work is mainly focused on optimizing the 
process of finding solutions with the purpose of reducing identical solutions and 
computation complexity and making our system run more efficiently. 
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Abstract. In this paper, it is inevitable that the AR two-dimensional codes 
obtained by the mobile device will have a tendency towards the mobile augmented 
reality application. A two-dimensional code tilts correction method based on least 
square method is proposed. After binarization preprocessing, the target two- 
dimensional code inclination angle is obtained by least-squares method. After 
further correction by bilinear interpolation, satisfactory results can be achieved. 
Restore the image. The experimental results show that the method can be quickly 
and effectively identify the target two-dimensional code, and then the corre- 
sponding three-dimensional model to enhance the reality display. 


Keywords: Augmented reality - 2D code - Tilt correction 
Least square method 


1 Introduction 


Augmented reality (AR) is with the development of virtual reality technology and 
produce a new kind of computer application and the technology of human-computer 
interaction [1]. It combines the computer-generated virtual environment with the real 
scene of the user by means of photoelectric display technology, interactive technology, 
multi-sensor technology, computer graphics and multimedia technology, so that the user 
can confirm from the sensory effect that the virtual environment is an integral part of 
the surrounding real scene. Unlike immersive virtual reality, augmented reality tech- 
nology is mainly based on the existing real world, to provide users with a new sensory 
composite visual effects, to expand the human cognitive and perceived ability of the 
world. Augment reality technology not only has the strong sense of reality, modeling 
on small workload, and more secure and reliable. 

Mobile augmented reality applications require mobile device software and hardware, 
using the device camera to capture real-world images, calculate the relevant information, 
integration of virtual scenes, and finally output to the screen, projectors and other display 
devices [4]. When the two-dimensional barcode image is acquired and decoded by the 
camera of the mobile phone, the acquired two-dimensional code is inevitably inclined. 
When the tilt angle exceeds a certain range, the decoded barcode can not be decoded 
correctly. Therefore, it is necessary to tilt the two-dimensional code. 


© Springer International Publishing AG 2018 
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The traditional two-dimensional code correction algorithm: Hough transform, 
Fourier transform [6]. The Fourier transform method uses the two-dimensional barcode 
tilt angle corresponding to the azimuth angle with the largest density of the Fourier space. 
The computational complexity is very high and is rarely used at present. Hough trans- 
form is the most commonly used method to detect the inclination angle. For the real- 
time processing of mobile phones using mobile phones to enhance the real-time display 
of 3D models may appear unpredictable error, this approach is first recorded two- 
dimensional code contour space coordinates, record the coordinates, the application of 
the minimum. The linear regression of these points by LSM gives the slope of the line, 
and the slope of the two-dimensional code is known. This can effectively improve the 
computation time. 


2 Overview of 2D Code 


As a new information storage and transmission technology, two-dimensional code can 
be used to express text and image processing information on many languages, and has 
the characteristics of high capacity, high density, strong error correction capability, fast 
positioning and automatic encoding and decoding [3]. It can not only encode the scene 
information associated with it to the landmarks, but also obtain the scene information 
on the decoding landmarks, which is obviously different from the traditional ARToolKit 
marker points for locating and recognizing [7]. The common mark point has QR code, 
like Fig. 1 shows. As a kind of two-dimensional code, QR code is based on the computer 
image processing technology, combined coding principle and so on, and has the function 
of automatic recognition read processing. 





a ARToolKit b QR code Mark point 


Fig. 1. Two kinds of mark points 


2.1 QR Code Internal Structure 


The QR code is a matrix-type two-dimensional code consisting of a square block 
consisting of a coding region (consisting of version information, format information and 
data and error correction code words) and function graphics (composed of a seek pattern, 
a delimiter, and a correction pattern component) [5], the coding region is a region for 
coding data or error correction code words, and the functional pattern refers to a specific 
pattern in the symbol for symbol localization and feature recognition, an imaging pattern 
located at three corners, It can help to determine the position, size and inclination of the 
symbol. The symbol is surrounded by a blank area. Figure 2 shows the internal structure 
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of the QR code. In the symbol, the dark module represents the binary “1”, the light 
module represents the binary “0” [9]. 


Version Information 


Blank area information 
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Position mark point 


Separator 


Format information 
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correction codewords 
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Fig. 2. QR code structure 


3 The Lean Correction Principle Based on LSM 


According to the structure of the QR code, firstly locate the position mark and record 
the space coordinates of its left boundary, construct the data set {X, Y}, where X, Y is 
the input vector, used to store the spatial coordinates of the boundary point of the 
contour [2]. 

After rotation, the left edge of the 2D barcode must be perpendicular to the X axis, 
then: 


min $, (@(x;.9;) - o(%y)) (1) 


According to the data set {X, Y } constructed, it can be linear regression according 
to f(x) = wx + b, we can see that the regression error of sample points 1s: 


é,=f(x)-Y=ox=b—Y (2) 


Substitute Eq. 1 into Eq. 2 
min Do a(x; y;) — œ(x,y) K 


ee ee an ee (3) 
= min D (e b-Y ( b Y)) 
= min X ley 
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Therefore, only the data set {X, Y } regression. When the inaccuracy e, variance is 
minimum, the parameter œ is the tilt vector. In this way, the optimization problem of 
the image tilt angle is transformed into the process of regression of the data set {X, Y} 
to identify the parameter œ. 

Suppose there exists a univariate regression model f(x) = ax + b, and the regression 
of the random variable Y on the independent variable is f(x). The univariate regression 
model is: 


y=ax+b+e e~ N(0,0°) 


Here parameters a, b, on the independent variables X are independent. 

Let the left boundary of the boundary contour of a two-dimensional code be inde- 
pendent of each other by N characteristic points [6]. The deviation between the estimated 
value and the sample 1s: 


5, = y; —f (x;) = y; — ax, — b 


By Eq. 3 we can see that when the deviation is the minimum angle is the tilt, the 
value is: 


min 6° = min >. (y; — ax; — b) (4) 


According to KKT conditions to obtain the optimal a, b value, that is 


N ye Xy; — ye x iQ Yi 
a= a a (5) 
Nes =, ži pm mj 


1 -1 -1 
Lo x? S i=0 XiT > Xi 2 XY; 
1 1 —1 
NA ar Da nA a 


The value of a obtained here is the slope of the straight line after the left edge of the 
fitting, and the inclination angle of the two-dimensional code in the horizontal direction 
can be obtained. 


b = (6) 


4 QR Code Correction Process 


In order to realize the detection of QR code in mobile augmented reality, we need to 
perform binary preprocessing of the obtained 2D code to make the later operation more 
accurate. And then obtains the coordinates of the boundary points of the QR code, and 
obtains the final image through the distortion correction, the skew correction and the 
bilinear interpolation mapping. Finally, the 3D model is obtained by decoding the 2D 
code image, and output to the screen, projector and other display devices to enhance 
Reality display. The flowchart is shown below (Fig. 3): 
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Fig. 3. Flowchart 


4.1 Image Preprocessing 
There are two kinds of formulas for converting the color image into grayscale image: 
Y = 0.299R + 0.587G + 0.114B (7) 
Y = (R + G + B)/3 (8) 


Here, we use the formula Y = (R + G)/2 and ignore the B component of the two- 
dimensional code image gray processing. The adaptive threshold segmentation algo- 
rithm is adopted in the binarization threshold processing. The basic idea of the algorithm 
is that the average gray value of S pixels is calculated when traversing the whole image. 
When the pixel value of a pixel is lower than this value, set to black, otherwise set to 
white. 
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4.2 Barcode Location 


After the bar code is binarized, the bar code needs to be positioned. According to the 
structure of the QR code, we take the location of the QR code, record its spatial coor- 
dinates, and construct the data set {X, Y}, where X and Y are the input variables. 


4.3 Distortion Correction 


In practice, the camera in take of QR code image will have different degree of distortion. 
Any geometric distortion can be defined by an equation that transforms the non-distorted 


coordinate system (X, Y) into the distortion coordinate system (X, Y ), which is gener- 
ally of the form: 


(9) 


{ x = h (x,y) 
y = h,(x, y) 


Let f(x, y) be the original image without distortion and g(x, y) be the result of f(x, y) 
distortion. The distortion process is known, and is defined by the functions h1(x, y) and 
h2(x, y) 


g(x’, y’) =f y) (10) 


Equation (10) that should appear in the image in pixels (x, y) on the grey value due 
to the distortion, and appear in the (x, y), the distortion problem might be solved by 


mapping transformation. In the case of known g(x’, y’), hl(x, y) and h2(x, y), the resto- 
ration process is as follows: 


1. Find the corresponding positions in g(x’, y’) for each point (x0, y0) in f(x, y): (M, 
N) = [h1(x0, yO), h2(x0, yO)]. M and N are the coordinate values of the spatial points, 


respectively. (M, N) does not coincide with any point in g(x’, y’), since M and N are 
not necessarily integers. 


2. Find the point (x1’, y1’) nearest to (M, N) in g(x’, y’), let f(x0, yO) = g(x1’, yl’), 


that is, give the grayscale values of g( x1 ,yl’ ) to f(x0, yO), according to this way 
point by point until the entire image is finished, and the geometry is corrected. 


4.4 Tilt Correction 


After the distortion correction, we record the left border of the bar code, using (5,6) to 
calculate the tilt angle a = arc tana; After the bar code is obtained, the rotation formula, 


that is 
Xnew | _ | cosa sina || Xo, 1 
Yew | | Sina cosa || Yoa (11) 
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If the Eq. (11) is used directly, for a wide (W) * high (H) image, the rotation requires 
4WH multiplication and 2WH addition, and the algorithm is highly computationally. In 
fact, for a height of H for the two-dimensional code image, the vertical direction of the 
projection maximum deviation is: 


Yna = H tan a 


The deviations for the j‘" column and the i" row in the image are, respectively (Fig. 4). 


J 
Ay, = (—~— )x w 
É (a) 


Ax, = ( —— | xH 
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Fig. 4. Spatial mapping of control points 


4.5 Bilinear Interpolation 


Some points that are not in the integer position may be generated during the image 
rotation transformation. This requires an algorithm for gray-scale interpolation to 
produce a smooth mapping that maintains continuity and connectivity. Interpolation 
methods include the nearest neighbor interpolation, bilinear interpolation and high 
order interpolation. The nearest neighbor interpolation is a simple interpolation 
algorithm, but the nearest neighbor interpolation algorithm will produce a clear 
image of the zigzag boundary. For the QR code with only black and white, bilinear 
interpolation can be used to produce a satisfactory image restoration effect [8]. The 
mathematical model is shown in Fig. 5: 
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Fig. 5. Mathematical model of bilinear interpolation 


First, by first-order linear interpolation: 
f(x, 0) = f(0, 0) + x[f(1, 0) — f(0, 0)] 
The same can be drawn: 


f(x, 1) = f(0, 1) + x[f(1, 1) — f(0, 1)] 
f(x, y) = f(x, 0) + y[fx, 1) — f(x, 0)] 


Merge the above three formulas: 


f(x, y) = [fC 0) — f(0, 0)]x + [£(0, 1) — f(0, 0) ly 
+ [f(1, 1) + £(0, 0) — f(0, 1) — f(1, 0) ]xy + f(0, 0) 


After this series of operations, the original two-dimensional code image into a more 
standard image, as shown (Fig. 6): 





A 
i (a)inputimage f(x y) ( h} Through the perspective trans- 
A formation ofthe outputimage s(" , ”) 


Fig. 6. Compared before and after correction 
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4.6 Download the 3D Model and Display It 


As shown in Fig. 7, QR code content with the soldier URL address, we can easily 
download from the server to the corresponding three-dimensional model, and through 
the camera to enhance its reality display. 





Fig. 7. Enhanced reality display 


5 Conclusion and Outlook 


This paper mainly focuses on the tilt correction method of 2-D codes in mobile AR. The 
LSM effectively improves the decoding time and decoding success rate. Start from the 
server to download the three-dimensional virtual model loaded into memory. Then, the 
model rendering output to the phone screen, and ultimately achieve the target QR code 
to enhance the reality show. However, this method only supports the identification of a 
single QR code. The next step is to consider a method that supports multiple QR code 
recognition and combine with the current excellent local feature point algorithm to 
realize fast and effective recognition of multiple QR codes. 
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Abstract. This article provides some questions about the large quantities of 
target track information, unclear sensor batch number, etc. referring to target 
fusion, and provide the clustering algorithm of Gaussian Mixture Distribution to 
realize the division of all the batch numbers of target track and associated oper- 
ation. According to the processing to the Gaussian distribution of information by 
maximum likelihood estimation and through continuous iterative and refinement 
of the clusters divided, the tracking information detected by all radar sensors of 
each target batch number can be obtained precisely in the end. This algorithm 
provides rapid and precise working conditions for target fusion. 


Keywords: Gaussian mixture distribution - Associated operation 
Maximum likelihood estimation - Clustering algorithm 


1 Introduction 


When the formation of ship combat synergistically, naval and radars in the fleet will 
report all the detected objectives about the main naval, which will create influences on 
the warship equipment attacked precisely according to the information. 

Not only this, the objectives detected by radars can only feedback its physical infor- 
mation and can not precisely judge the target batch number defined by human. In the 
current literature, radars’ objective fusion rarely have research, which aims at the two 
sides. In order to simplify the target quantity, avoid unnecessary redundant target infor- 
mation. The associated algorithm of target batch number in this article will connect the 
prototype cluster in the ensemble learning to solve the related problems between target 
information and target batch number. 


2 The Data Analysis 


The command system of formation of ship receives periodically signals from different 
radars in formation. Here it is set as M pieces of sensors, with each sensor as m. It can 
periodically make up instructions and send N pieces of tracking information processed 
by itself and under the unified coordinate system. Here the ensemble of communication 
of multiple batches of tracking information sent by No.m sensor is set as D,, and among 
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it, some batch of tracking information is d,,,(PS: m = 1, 2,...... as the tracking infor- 
mation detected by NO.m sensor. n = 1, 2...... as the NO.n target detected). 


3 Target Batch Correlation 


Target batch correlation is the basis of target fusion. Exactly differentiating DS is the 
precondition of fusion algorithm accuracy. In the war of formation of ship, because it 
is very difficult to recognize the target without exactly differentiating a large amount of 
tracking information detected, the accuracy of the algorithm is very important. 

Firstly define some basic knowledge. As it is mentioned earlier, target tracking 
information includes a great amount of data, such as a sensor’ serial number, target 
nationality, target type, longitude and latitude, height and a series of data. Here the 
number of attributes is set as k and set dmn = {Xmn1; Xmn2>--+> Xmnk}, the NO.n target 
tracking information detected by NO.m radar as a k-dimensional vector. Xmng 18 the value 
of NO.k of a some batch tracking information (such as, the above target type’s value on 
Xijk 18 “the air”), while k is the dimension of the tracking information vector. 

The result of batch correlation is a list, with a batch number on each line. Batch 
numbers correspond to the tracking information collection under the target detected by 
the related sensor. The collections can not be controlled strictly or intersected and the 
parallel operation of these collections is D, which is all tracking information detected 
by all radars. Here we call every subset as a “cluster”. Through this divide, each cluster 
stands for tracking information collection of the same target detected by the radar sensor. 
After the target batch correlates with tracking information, the list formed by all clusters 
report to fusion center to perform blend operation. 

It is formally said that D can be divided into h pieces of disjoint clusters {Cli = 1, 
2..., h}, with C; N C; # Ø. Accordingly, each cluster’s marker is set as the target batch 
number. We use A = {Aj, Ao,..., An} to represent tracking information clusters’ marker, 
including the value of à as the target batch number. 

Above we introduce is some basic knowledge about association algorithm. The 
association algorithm the article studies will be researched by connecting clusters under 
unsupervised learning. We just need to divide the D collection formed by all tracking 
information into multiple joint clusters. But we need to solve the problem of distance 
calculation. 


3.1 Gaussian Mixture Clustering Algorithm 


We know that Gaussian distribution is fixed by two parameters among it. The two 
parameters are mean vector u and covariance matrix G. We define Gaussian mixture 
distribution first: 


h 
PM(x) = )? a:p(x|y;,G) (3.1.1) 


i=1 


The Associated Algorithm of Target Batch Number 25 


Among it, the latter part p(xlu;, G) is the probability density of Gaussian distribution. 
Suppose that all tracking information is produced by Gaussian mixture distribution, the 
specific process is to fix different Gaussian mixture distribution probability density 
according to 0), @2,..., Ap. After fix the probability density function, put different tracking 
information in different Gaussian distributions. In here we can see that Gaussian mixture 
distribution is the mixed composition of multiple Gaussian distribution. Each Gaussian 
distribution’s probability is controlled by a parameter œp. The probability is how much 
the the size is about some tracking information has been specifically related to one 
Gaussian distribution. 

It can be seen that three parameters ;, ;, G among it should be fixed if defining 
Gaussian mixture distribution. Suppose that all tracking information has been 
completed the correlation, how to fix the target batch number of some tracking 
information, which means how to fix cluster marker of the tracking information 
cluster some tracking information is in. Suppose that the tracking information cluster 
of some tracking information dj; is in is z. In Gaussian mixture distribution, Zmn is the 
Gaussian distribution some sample is in. The value range is 1~h. But we have no way 
to know its specific value. According to Bayes’ theorem, posteriori distribution of 
Zij 18 


P(Zmn = h) -PM(dinnlZ 
PM (dina) 
— _ % + Pin His O) 
Eii% * Pm Hs G) 


mn = h) 


PM (Zan = hldi) = 
(3.1.2) 


Which means we have got the posteriori probability of each tracking information 
cluster dmn 18 in, marked as Y, When we want to know the batch number of some target 
tracking information, it is only to find the max one the dmn corresponds to. 


3.2 Maximum Likelihood Estimation 


To all tracking information collection D, perform maximum likelihood estimation. 


m m h 
LL(D) = o( pmo) = $, ind) % + Pan lpi G)) (3.2.1) 
j=l j=l i=l 

Maximum likelihood estimation is to maximize the above formula. The derivative 
is O with the maximum value. The partial derivatives of three parameters are 0 through 
the formula 3.2.3. 

Firstly, get out the partial derivative to p; through the formula 3.2.3, with the values 
of derivatives as 0, and according to the formula 3.2.2, get out 


= > aan 


oa (3.2.2) 
ae 
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Similarly, through the derivation of G, get out of 


= ia (daa ~~ H; ) ides — pi) 
Dhe 


Because a; is the related probability of target batch selected by tracking information, 
the sum of ©, %2,..., Ap Should be 1, which all are greater than and equal to 0. According 
to the lagrange formalism of maximum likelihood estimation, perform a; partial deriv- 
atives and get the values with the partial derivative as 0, get out 


G (3.2.3) 


1 
= — È Yoz (3.2.4) 


After get every parameter’ s calculation method and use iterative process of Gaussian 
mixture clustering, which means getting posterior probability of each Gaussian distri- 
bution through current parameters and current tracking information. Then through the 
formula, update three parameters till reach the maximum value of iterative or LL(D) 
with a little improvement or no improvement, stop updating the parameters and finally 
make sure the tracking information included among every tracking information. And 
the corresponding marker of every tracking information is target batch. In this way, the 
overview that flow the target tracking information relates is shown in following figure. 
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4 Justification and Summary 


We define a measure that evaluates the clarity of the tracking information to prove the 
advantages of this algorithm. Davies-Bouldin index: 


: P (ee 


desn (p, H; ) 


Among them, d,.,(...) 18 to calculate the distance between the center of two clusters, 
avg(C) is the average distance of each trace information in the cluster C. So, we can see, 
the denominator in the DB index is the distance between the two tracking information 
clusters, the larger the denominator, the lower the similarity of the two clusters, and the 
molecules represent the tracking information aggregation in two clusters. So, the smaller 
the value of the DB index, the better the result of the algorithm. In a tracking information 
data set of 11 ships, the results are obtained 0.32~0.33, it’s better than the general infor- 
mation. 

The algorithm promoted in this article controls all the tracking information within 
the k-dimension space, proceeds loop iteration and fix the collection of tracking infor- 
mation detected by multi sensors each target corresponds to, which realizes the connec- 
tive work of precise target batch number and complete the work within the quadratic 
time. 
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Abstract. A density peaks clustering based on improved RNA genetic algo- 
rithm (DPC-RNAGA) is proposed in this paper. To overcome the problems of 
Clustering by fast search and find of density peaks (referred to as DPC), 
DPC-RNAGA uses exponential method to calculate the local density, In addi- 
tion, improved RNA-GA was used to search the optimums of local density and 
distance. So clustering centers can be determined easily. Numerical experiments 
on synthetic and real-world datasets show that, DPC-RNAGA can achieve better 
or comparable performance on the benchmark of clustering, adjusted rand index 
(ARI), compared with K-means, DPC and Max_Min SD methods. 


Keywords: RNA genetic algorithm - Clustering - Density peaks 
Distance 


1 Introduction 


Clustering, a kind of unsupervised learning technique, aims at dividing a given pop- 
ulation into groups or clusters with common characteristics, so that objects in the same 
cluster are similar to one another and dissimilar to objects in any other clusters. 
Clustering is widely used in exploratory pattern analysis, grouping, decision making, 
and machine learning situations, including data mining, document retrieval, image 
segmentation, and pattern classification [1]. Many different clustering methods exist 
including K-means clustering algorithm, Density-Based Spatial Clustering of Appli- 
cations with Noise (DBSCAN), Affinity Propagation (AP) and so on [2, 7, 9, 10]. 

Recently, a novel clustering algorithm was proposed by Alex Rodrguez and 
Alessandro Laio [3], based on the assumption that clustering centers are surrounded by 
the neighbors with lower local density and that they are at a relatively large distance 
from any points with a higher local density. We refer to this algorithm as DPC (Density 
Peaks Clustering) in this paper. It was demonstrated on several test cases that DPC can 
efficiently find the cluster centers (i.e., the density peaks) and assign the remaining 
points to their appropriate clusters as well as detect outliers. Several researches have 
been going on around this method [4]. However, DPC does have some shortcomings. 
The clustering centers are only determined by the multiplication of the density and the 
distance, which affects the selection of the best clustering centers, what’s more, the 
calculation of the density is highly depending on the cutoff distance de. That affects the 
generalization of the algorithm. 
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In order to overcome these problems, we propose a novel DPC based on RNA 
genetic algorithm (DPC-RNAGA). The method completes clustering by density and 
distance analysis based on RNA-GA, which computes density with exponential method 
to reduce the impact of cutoff distance and adopts RNA-GA to search the optimal 
thresholds of the density and distance to determine the best clustering centers. The 
benchmark of clustering, rand index, is taken as the fitness function. To overcome the 
excursion of search re-gion and accelerate convergence, a penalty factor was intro- 
duced in the RNA-GA. Meanwhile, the selective crossover operators improve the 
effectiveness and accuracy of the algorithm. 

This paper is organized as follows: Sect. 2 briefly describes the related works of the 
improved algorithm which include density peaks clustering, the improvement of 
RNA-GA. Section 3 introduces our novel clustering algorithm DPC-RNAGA and 
gives a detailed analysis. Section 4 tests our proposed algorithm on several real-world 
datasets and synthetic datasets, and compares its performance with K-means, DPC and 
Max_Min SD on adjusted rand index (ARI). Section 5 draws some conclusions. 


2 Related Works 


The proposed DPC-RNAGA is based on DPC and RNA-GA. This section provides 
brief reviews of DPC and RNA-GA. 


2.1 Density Peaks Clustering 


Rodriguez and Laio proposed an algorithm published in Science and has appealed to 
many machine learning researchers. Its idea is that clustering centers are characterized 
by a higher density than their neighbors and by a relatively large distance from points 
with higher densities. This method utilizes two important quantities: One is the local 
density of each point i^! and the other is the distance from points of higher densities ĝi. 
In the following, we will describe the calculation of p; and ô; in detail. Assume that the 
data set is Xnxm = [X15 %05°°° el where y; = [Xin Xoj0° °° Yni is the vector with 
M attributes and N is the number of points. The distance matrix of the dataset needs to 
be computed firstly. 

Let d (Xi, X) denote the Euclidean distance between the point y; and %;, The local 
density of a point y,, denoted by p,, is defined as: 


Pi = Da -— de) 


lifx<0 


TE (1) 


Where de is cutoff distance, p; is defined as the number of points that are adjacent to 
the point %;, de is an adjustable parameter and the only variable in Formula (1). Here we 
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use the other way to calculate the local density called exponential kernel function, it is 


defined as follows: 
As? 
p= op (- (+) (2) 
j#i i 


The computation of 0; is quite simple. The minimum distance between the point y; 
and any other points with higher densities, denoted by 0,, is defined as: 


= eee (d(x: x))) if Hjs.t.p; > P; (3) 


max; (d ( Xi, %;)) if otherwise 


Only those points with relative high p; and high 0; are considered as clustering 
centers. The points with high p; and high ò; are also called as peaks. After clustering 
centers have been found, DPC assigns each remaining points to the same cluster as its 
nearest neighbor with higher density. 


2.2 The Improvement of RNA Genetic Algorithm 


RNA encoding and decoding. In the RNA-GA, a section of chromosome is used to 
encode one parameter and a chromosome is used to encode all the parameters. 
Specifically, a chromosome represents the values of p; and 0; in this study. 


Genetic operators and penalty factor. The tournament method is adopted as the 
selection method in DPC-RNAGA [5]. However, in the special circumstance, It was 
found that the search field is biased after several experiments. So we propose the 
penalty factor d to get rid of inferior individuals whose fitness value stays the last d 
percentage of all. If the initial population is Q; = {q1,q2,---,dm}, the remaining ratio 
of the parent generation is r, so the number of individuals remaining from the parents is 
r xm. 

Selective crossover operator is designed to enhance the ability in exploring the 
searching space in this paper. According to the crossover probability pe, the two parents 
exchange the bases between the two crossover points to produce two offspring 
individuals. 

In this paper, we adopted the ordinary mutation operators but with adaptive 
probability. We divide the nucleotide bases into two parts: left and right parts. Cor- 
respondingly, the adaptive probabilities pml (left) and pmh (right) are described as 
follows: 


E b 
pml = a + ree E] (4) 


bı 


PO exp[—aa(g—go)| 
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Where a, denotes the initial mutation probability of pml, bı denotes the range of 
transmutability. The parameter g is evolutionary generation, and gọ denotes the gen- 
eration where the maximum of mutation probability is reached, and aa is the speed of 
change. 

Fitness function. In this paper, Rand index was adopted as the fitness function of 
RNA-GA. The Rand index is defined as the ratio of the same data pairs from the two 
clustering results and all the data pairs [6]. The value of rand index ranges from O 
through 1, and the value close to 1 shows that the clustering is better. 


3 The Density Peaks Clustering Algorithm Based 
on Improved RNA-GA 


The density peaks clustering algorithm based on improved RNA-GA is an adaptive 
way to select the best clustering centers without human intervention and reduce the 
impact of cutoff distance on the calculation of local density. Firstly, calculate p and ô of 
each point and then generate m couples of parameters randomly to form the initial 
population. After the execution of RNA-GA, we can get two thresholds. Secondly, we 
can determine the clustering centers according to the thresholds. Thirdly, complete the 
clustering by assigning remaining data points. Here we cluster the points to the same 
cluster as the nearest center with bigger local density. Calculate the fitness value when 
the clustering process is finished. The algorithm stops until the fitness value is stable or 
the maximum of iteration is reached. 

Assuming the size of dataset is n. The iterations of RNA-GA cost most time. The 
complexity of each iteration is O(n’), the numbers of iteration is Gmax. The time 
complexity of the algorithm is O(mG max n7), the space complexity of the algorithm is 
due to the storage of the density and distance metrics. So the space complexity is 
O(n”). Compared with several typical clustering algorithms, the complexity analysis of 
the algorithm is shown as Table 1. 


Table 1. The comparative complexity analysis of algorithms 


Algorithm Time complexity | Space complexity 


DPC-RNAGA | O(mG max n?) 

















DPC O(kn7) O(n?) 
K-means O(ktn) O(n) 
Max_Min_SD | O(n?) O(n?) 


From Table 1, we can see that the time complexity of DPC-RNAGA is of the same 
order as DPC. Although it is higher than K-means, it is still acceptable. And it doesn’t 
affect the application of the algorithm. 
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4 Experiments 


To evaluate our proposed algorithm, clusters created by DPC-RNAGA are compared 
with state of art clustering algorithms on synthetic datasets and real-word datasets. The 
synthetic datasets are collected at http://cs.uef.fi/sipu/datasets/ [8]. The details of the 
datasets are shown in Tables 2 and 3 correspondingly. 























Table 2. Synthetic datasets Table 3. Real-world datasets 
Dataset Classes Dataset Classes 
Aggregation T Iris 3 
Spiral 3 Wine 3 
D31 31 Glass 6 
flame 2 Seeds 3 


To evaluate the performance of DPC-RNAGA, we used eight kinds of datasets and 
compare with K-means, DPC and Max_Min SD. The dimensions of synthetic datasets 
are small and the shape of clusterings is obvious. So we set that the size of the 
population is 50, the maximum of the genetic iteration is 30, the penalty factor is 5. 
However, for real-word datasets, the clusterings are mostly crossed and there are many 
noises. So we set that the size of the population is 100, the maximum of the genetic 
iteration is 30, the penalty factor is 1. The parameters of another three algorithms are 
set according to the actual clusterings. 

In our proposed algorithms, RI is used as fitness value. Table 4 shows the RI got 
from four kinds of clustering algorithms dealing with eight kinds of datasets. The RI is 
more close to 1, the clustering result is better. The underlined RI shows the best RI for 
the special datasets. The experiment result shows that our proposed algorithm performs 
better than the other three methods. 

The results show that four clustering methods perform well when the datasets is 
small and there are few noise points, however, when noise point appears, our proposed 
algorithm performs better than DPC; What’s more, because real-word datasets have 
more noises and crosses, and the dataset is large, our proposed algorithm got better 
clustering results than the other three algorithms. 


Table 4. Comparison of RI benchmark for 4 clustering algorithms 


Dataset | RNA-GA-DPC DPC |Max_ Min_SD 




















Aggregation | 0.9993 0.9434 
Spiral 1.0000 | 0.5813 
D31 0.9934 
Flame 0.4829 
Iris 0.8923 | 0.7931 
Wine 0.4341 
Glass 0.4621 








Seeds 0.7049 
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5 Conclusion 


A clustering algorithm based on RNA genetic algorithm and finding density peaks was 
proposed in this paper. This method computes the local density of data points with the 
exponential kernel methods, which reduced the dependence on the parameter de. 
RNA-GA is used to search the optimal density and distance from other data point with 
higher density, which makes the clustering algorithms more effective. Penalty factor is 
added to the RNA-GA, which avoids the search region getting wrong way and 
accelerates convergence. The power of DPC-RNAGA was tested on several synthetic 
datasets and real-world datasets. The results demonstrate that DPC-RNAGA is pow- 
erful in finding clustering centers and recognizing clusters regardless of their shape and 
size. The comparison analysis of several clustering algorithms shows that our proposed 
method can be fit to many different datasets and perform better than the existing DPC, 
K-means and Max_Min_SD methods. 
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Abstract. Homogeneous Spiking Neural P Systems (HSN P systems, for short) 
are a class of neural-like computing models in membrane computing, which are 
inspired by neurons that they are “designed” by nature to have the same “set of 
rules’, “working” in a uniform way to transform input into output. HSN P systems 
can be converted to weighted homogeneous SNP systems. In this work, based on 
the above two known systems, we consider a restricted variant of SN P systems 
called local homogeneous weighted SN P systems (LHWSN P systems, for short), 
where neurons in same module have the same set of rules. As a result, we prove 
that such systems can achieve Turing completeness. Specifically, itis proved that 
using only standard spiking rules is sufficient to compute and accept the family 
of sets of Turing computable natural numbers, moreover local homogeneity 
reduces the time required for the execution of the system. 


Keywords: Spiking Neural P system - Weight - Local homogeneous 


1 Introduction 


Membrane computing [1] is one of the recent branches of natural computing, whose aim 
is to construct powerful computing models and intelligent algorithms by abstracting 
ideas from a single living cell and from complexes of cells. The obtained models are 
distributed and parallel computing devices, usually called P systems. There are three 
main classes of P systems investigated: cell-like P systems [1], tissue-like P systems [2] 
and neural-like P systems [3]. Spiking Neural P systems (SN P systems, for short) are 
a class of neural-like P systems, which are inspired by the way of biological neuron in 
human brain processing information and communicating with each other by means of 
electrical impulses (spikes) [3]. 

In the research of SN P system, the results mainly focus on three aspects: theory, 
application and implement. Theoretical level 1s divided into three aspects: generating 
sets of numbers [4—10], generating languages [11, 12], computing any Turing comput- 
able functions [13, 14]. In recent years, we have studied a variety of SNP variants and 
analyzed their ability to generate sets of numbers [4, 6-8, 15, 16]. Particularly, Zeng 
proposed the homogeneous SNP systems in 2009 [4], and proved that homogeneous 
SNP systems have Turing universality in generating and accepting modes. In 2015, Song 
proposed homogenous SNP Systems with Inhibitory Synapses (through inhibitory 
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synapses, spikes is converted to the anti-spikes) [16], and proved the systems have 
Turing universality in the case of using only the spiking rules. 

Inspired by various of SNPS, this article proposed a variant of SNP system based on 
a new biological fact and study the ability of generating sets of numbers in two modes 
of the new system, besides that, we analyze the time need to execute the system. The 
biological fact is that although neurons are “designed” very similar, but there are still 
some differences between different neurons. According to functional division, neurons 
can be divided into three categories: sensory neurons, connecting neurons, motor 
neurons. In the nerve center, neuronal cell bodies having the same function gather 
together to regulate a physiological activity of body (the definition of nerve center). That 
is to say, there are some differences in the rules of the neuron, but the neurons of the 
same rules will be gathered together to cooperate with each other to complete a certain 
physiological function. We introduce this biological phenomenon into SN P systems 
and the SNP systems have local homogeneity. 

In LHWSN P systems, the rules used by neurons in the same function module are 
identical, and the weights on the synapse can be positive integer and negative integer. 
When the spike passes through the synapse with negative weights, the spike is changed 
into the anti-spike and the number of the spike is doubled according to the absolute value 
of the weight. 

In this paper, we proved that the LHWSN P systems have Turing universality. First, 
we conclude that the delay of rules can be removed but the systems still have Turing 
universality. Second, we use negative weight to express inhibitory synapse so the 
forgetting rule in neurons is removed. Third, the proposing of local homogeneous 
reduces the total time needed to execute the system. 

In the next section, we introduce prerequisites and the definition of LHWSN P 
systems. In Sect. 3, we investigate the computation power of the system. Conclusions 
and remarks are drawn in Sect. 4. 


2 Local Homogeneous Weighted SN P Systems 


Before we introduce LHWSN P systems, we recall some necessary prerequisites. It is 
useful for readers to have some familiarity with basic elements of formal language 
theory, e.g., from [17], as well as basic concepts and notions of SN P system and register 
machine [18]. 


2.1 Background 


Generally, an SN P system is composed of neurons, spikes, synapses, and rules (spiking 
rules and forgetting rules). A neuron can send information to its neighboring neurons 
by using the spiking rule. A predefined number of spikes will be removed from the 
neuron by using the forgetting rule, and thus they are removed out of the system. One 
neuron in the system is specified as the output neuron, which can emit spikes to the 
environment. 
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In the study of the computing power of SN P system, most of them are proved by 
simulating calculation process of the register machine. A register machine is a construct 


M = (m, H, lo» lp; R), where m is the number of registers, H is the set of instruction labels, 
|, is the start label, 1, is the halt label, and R is the set of instructions. The instructions 
are of the following forms: 


o |: (ADD(), I, 1.) (add 1 to the register r and then go to one of the instructions with 
label I; and |,, non-deterministically chosen), 

° l: (SUB(r), I, lų) (if register r is non-empty, then subtract 1 from it and go to the 
instruction with label l, otherwise go to the instruction with label 1,), 


e | HALT (halt the calculation of the register machine and regard the number in register 
1 as the result of the register machine). 


Register machine has two kinds of working modes: generating mode and accepting 
mode. In generating mode, the register machine computes number n as follows: starting 
M with all registers empty, the initial instruction with label |, is applied, and the instruc- 
tions are applied as indicated by labels. When the register machine M proceeds to the 
halt instruction, number n stored at that time in the first register is said to be computed 
by M. In accepting mode (using deterministic ADD instructions), a random number is 
stored in the first register (other registers are empty). If the computation starting from 
the initial configuration eventually halts, the number is said to be accepted by M. 


2.2 System Description 


In this section, we define the LHWSN P system. The definition is complete, but famil- 
larity with the basic elements of homogeneous SN P systems, anti-spikes and SNP with 
weights (e.g. from [4, 7, 19]) is helpful. 

An LHWSN P system of degree m > 11s a construct of the form: 


| = (O, 6), 05, ....6,,, syn, in, out) 


Where 
èe o = {a,a}is the alphabet, where a is spike and a is anti-spike; 


è 0,,05,...0,, are neurons of the form o, = (n,, R) with 1 < i < m where 
(a) n; is a natural number representing the initial number of spikes in o;; 
(b) R is set of rules in each neuron of the following forms: 

1. E/a° > ais the spiking rule, where E is the regular expression over {a}. 
(c is integer and c > 1); 

2. a > à is the forgetting rule, with the restriction that for any s > 1 and any 
spiking rule E/a° — a;d,a* ¢ L(E), where L(E) is set of regular languages 
associated with regular expression E and À is the empty string; 

3. aa > À. 
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e syn C {1,2,...,m} X {1,2,...,m} x Z (Zis integer) is set of synapses between neurons, 
where 1 # j, Z#0 for each (i,j, z) E syn, and for each(i,j) E€ {1,2,...,m} x {1,2,...,m} 
there is at most one synapse (i, J, Z) in syn. 

e In, out E {1,2,...,m} indicate the input and output neurons respectively. 


In LHWSN P systems, the rules used by neurons in the same function module are 
identical, but the rules of different functional modules are different. A neuron may 
contain a number of spikes or anti-spikes, but only one of them. In LHWSN P systems, 
spiking rules (E/a° — a) can be applied in any neuron as follows: if neuron o, contains 
k spikes a with aX € L(E) and k > c, the spiking rule E/a° — a is enabled to be applied. 
By using the rule, c spikes a are consumed, thus k — c spikes a remain in the neuron o,, 
and one spike a is sent to all neurons 0; such that (i,j,z) € syn. Rules of the form 
a’ > à,s > lare forgetting rules with the restriction a’ € L(E) (that is to say, a neuron 
cannot apply the spiking rules and forgetting rules at the same moment), where L(E) is 
set of regular languages associated with regular expression E and À is the empty string. 
If neuron o; contains exactly s spikes, the forgetting rule a’ — à can be applied, by which 
s spikes can be removed from the neuron. 

The set of synapses is denoted by (1, j, zZ). Specifically, neuron o; spikes at certain step 
sending a spike to neuron ©; along the synapse (i, j, z), if z > 0, neuron o; will receive z 
spikes; if z < 0, neuron o; will receive —z anti-spikes. The result of a computation of the 
system |] is defined as the time internal of first two spikes being emitted to the envi- 
ronment by the output neuron at steps t,, t, in the form of number t, — t,; we say that this 


number is generated by | [. The set of all numbers computed in this way by |] is denoted 


by N, (T] (the subscript 2 indicates that we only consider the distance between the first 
two spikes of any computation.) 


An LHWSN P system | [| can also work in the accepting mode. A number n is intro- 


duced in a specified neuron in the form of f(n) spikes by reading spike train from the 
environment through the input neuron. If the computation eventually halts, then number 


n is said to be accepted by |]. The set of numbers accepted by |] is denoted by N.),... (TI) 
(the subscript acc indicates the system works in the accepting mode). 

We denote by N,|LHW,SNP,,,, & € {2, acc} all sets of numbers generated or accepted 
by LHWSN P systems of degree m. If the parameter m is not bounded, then it is replaced 
with x. The absolute value of the weight is not more than K. 


3 The Computing Power of LHWSN P Systems 


3.1 Proof of Turing Universal 


In this section, we proved that the LHWSN P systems have a universal computation in 
the generating mode and the accepting mode. In the following proofs, a directed graph 
is used to represent the structure of the system: the neurons are placed in the nodes of 
the graph connecting with each other by the synapses; the output neuron has an outgoing 
synapse, emitting spikes to the environment. 
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Theorem 1. N,LHW,SNP.,, = NRE 
It is enough to prove the inclusion NRE C N,IHW,SNP., for the converse inclusion 
can be invoked from the Turing-Church thesis. In the following universality proof, a 


LHWSN P system |] is constructed to simulate the computation of M. 

Each register r of M is represented by a neuron o,, whose content corresponds to the 
number stored in register r. Specifically, if the number stored in register ris n > O, then 
neuron o, contains 5n spikes. In the initial configuration, the number stored in each 
register of M is O, so that there is zero spike in neurons representing those registers. One 


neuron ©, is associated with each instruction labeled 1, of M. During a computation, a 
neuron o, is spiked and start to simulate an instruction |,:(ADD@), 1, 1%) or 


1,:(SUB (r), l; ly l), starting with neuron o, activated, operating neuron o, as requested by 
ADD or SUB; then introducing spikes into neuron G, or neuron O}, which becomes active 
in this way. When neuron o, (associated with the label 1, of the halting instruction of 
M) is activated, a computation in M is completely simulated in | [; the FIN module starts 
to output the computation result. 


The LHWSN P system | | is composed of three types of modules, ADD modules, 
SUB modules, and a FIN modules, which are used to simulate the ADD, SUB, and HALT 
instructions of M. 

Module SUB shown in Fig. 1. The rules for neurons of SUB module are 


+ 
a > aand at (af) /at > a. 


L, J 
À i 


H ; 

TA 
1 g 
ly l, 


Fig. 1. The structure of SUB module 


Assume at step t, the system starts to simulate a SUB instruction l; of M. Having one 
spike inside, neuron o, fires at step t emitting a spike to neurons Oj: and o,. The spike 
arriving at neuron o, will be changed into an anti-spike due to the negative weight, and 
there are the following two cases in neuron o,. 
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— Before receiving the anti-spike, if the number stored in register r is not zero, then 
neuron o, contains at least 5n spikes. After the annihilation, the number of spikes in 
neuron 6, is decremented to 5(n — 1) + 4, which simulates decreasing the number in 
register r by one. With 5(n — 1) + 4 spikes inside, neuron o, fires at step t + 1 by using 


the rule af (a5)" /a* — a and sends one spike to neuron Op, Op- The spike arriving at 
neuron Oop will be changed into an anti-spike due to the negative weight. Meanwhile, 
neuron Op receives one spike from neurons op. The anti-spike in neuron op will be 
annihilated by the spike from neuron op. So neuron Op, ends with no spike or anti- 
spike inside and remains inactive. By receiving the spike, neuron Op fires at step 
t+ 2, and the neuron o will be fired at step t + 3 to start to simulate instruction |; of 
M. It is worth to point out that 5 spikes are sent back to neuron o, at step t + 3 from 
neuron Op, which represents that the number in register r is still n. 

— Before receiving the anti-spike, if the number in register r is zero, then neuron o, has 
no spikes. After receiving the anti-spike, neuron o, has one anti-spike. At step t + 1, 
neuron oy fires sending one spike to neuron Op. At step t + 2, neuron Op sends a spike 
to neuron o, and neuron o,. In neuron o,, the anti- spike is annihilated, and it keeps 
inactive. At step t + 3, neuron o, receives a spike from neuron op and becomes active, 
which starts to simulate instruction l, of M. 


The simulation of SUB instruction is correct: system P starts from neuron o, and 
ends in neuron o, (if the number stored in register r is great than 0 and decreased by one), 
J 
or in neuron o, (if the number stored in register r is 0). 


Module ADD shown in Fig. 2— simulating the ADD instruction 1: (ADD(r),1 j> lk le): 
The neurons of ADD module are identical, and the rules contained in the neurons are 


3 


a>a,a’ > a,a? > a, and a/a? > a. 





l: m 
s 


Fig. 2. The structure of ADD module 





G 
— 


l; 
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The initial instruction |, of M is an ADD instruction. Let us assume that at step t, an 


instruction l: (ADD(r), li, I) has to be simulated, with one spike in neuron o, and no 
spike in any other neurons, except in those neurons associated with registers. Having 
one spike inside, neuron 0, fires at step t sending one spike to neuron 0}, op and o,, and 
three spikes to op respectively. Neuron o, sends one spike to neuron o,, which simulates 
increasing the number in register r by one. With three spikes inside, neuron op non- 
deterministically chooses one of the two spiking rules a’ > a and a? /a? —> ato apply at 
step t+ 1, which will non-deterministically activate neuron ©, or Oj; 


— In neuron o,, if rule a? — ais chosen to be used, then it sends one spike to neurons 
©; and Op, but when the spike arrives at neuron ox, it will be changed into an anti- 
spike due to the negative weight between neurons Op and op. Neuron o; receives an 
anti-spike from neuron 6, and a spike from neuron Op. So neuron Oj ends with no 
spike or anti-spike inside and remains inactive. At step t+2, neuron Op receives a 
spike from neuron c, and neuron op has an anti-spike. At stept + 3, neuron 6, receives 
two spikes from neuron Op. In neuron 6, the anti-spike inside will be annihilated by 
the spikes from neuron oy. With one spike inside, neuron 6, fires at step t + 3 by using 
the rule a > a and sends one spike to neuron o,. By receiving the spike, neuron ©, 
fires at step t + 4, simulating instruction l, of M. 

— In neuron oy, if rule a’ /a* —> ais chosen to use, then after the rule is applied, neuron 
Op can spike for the second times by using rule a — a. It indicates that neuron o; 
receives spike and 6, receives anti-spike in t + 2 and t + 3 two steps. The two anti- 
spikes in neuron oy can be annihilated by the two spikes from neuron o, arriving at 
step t+ 3.With no spike or anti-spike inside, neuron o, keeps inactive. In neuron 
6, after annihilating the anti-spike inside, it has one spike and fires at step t + 3, 
sending one spike to neuron Ol; Neuron o fires at step t + 4 to start to simulate 


instruction |, of M. 
Therefore, from firing neuron o,, the system adds one spike to neuron o, and non- 
J 
deterministically spikes one of neurons ©, and o,; which correctly simulates the ADD 
J 
instruction l: (ADD(), I, l). 


Module FIN shown in Fig. 3—outputting the result of computation. FIN module is 
composed of neurons from Fig. 4. 
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Fig. 3. The structure of FIN module 


a? >a 
a(a?)t/a? >a 


a? (a°)t/aè >a 





Fig. 4. The neuron of FIN module 


Assume now that the computation in M halts (that is, the halting instruction is 


reached), which means that neuron o, in || has two spikes and fires. The evolution of 


the numbers of spikes in the neurons during the computation of the FIN module is shown 
in Table 1. 


Table 1. The numbers of spikes in neurons of FIN module during the process outputting the 
computational result. 


Neuron | Step 


t hi haz Mes Je fes [tent [t+ nto |{t tnt 
ao |i jo jo jo f+ jo Jo jo [o 
a  |5n|5n+2 |5@-D+2|50-d)+1|- [s+1 [1 f1 fo 
oa o fo o i |0 fo fo fo 
oœ oo BR BR k p P f [o 
E jo ff 2 
ow lo jo jo bR I- jo jo fo 2 


According to the function of the above three modules, it is clear that the register 


machine M can be correctly simulated by LHWSN P system |]. Therefore, LHWSN P 
systems can characterize NRE. It is also worth to mention that there are only standard 
spiking rules, and no forgetting rule is used. Moreover, we can see that the maximum 


number of rules in neurons is 44n ADD module), so we greatly reduce the number of 
rules in the neurons. 
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Theorem 2. N_,..cLHW,SNP,, = NRE 

LHWSN P systems working in accepting mode can accept any set of Turing comput- 
able natural numbers. This result is achieved by simulating register machines working 
in accepting mode. In the simulation, the FIN module is not necessary any more, but an 
INPUT module is needed to receive spike train 10n-11 from the environment. The 
number n of steps elapsed between the two spikes is the number checked to be accepted. 
After two spikes are received, if the system finally halts, then number n is said to be 
accepted. 


Module INPUT is shown in Fig. 5. An LHWSN P system []’ working in the 
accepting mode is constructed to simulate M. Each register r of M is associated with a 


neuron o, in IT: the rules in neuron area > a and af (a° ) j /a* — a. The evolution of the 
numbers of spikes in the neurons during the computation of the INPUT module is shown 
in Table 2. 





Fig. 5. The structure of INPUT module 


Table 2. The numbers of spikes in the neurons of INPUT module during the process of reading 
spike train 10°7'1 


Neuron (Step | | 
to ted Mea [ees | [ttm eea i te [t+ 43 
o, [14 10 [sis fs E e 


ao do o |s Sx2 |+  |5m@-D]|5n Sn 
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During the process of reading spike train 10"~'1, the evolution of the numbers of 
spikes in the neurons of INPUT module is shown in Table 2. 
In accepting module, ADD module is deterministic and its form is 


(L: (ADD(), L) ), which is indicted in Fig. 6. The neurons of this system contain rule 
a >a. 





Fig. 6. The structure of deterministic ADD module 


Module SUB remains unchanged (as shown in Fig. 1), module FIN is removed, with 
neuron ©, remaining in the system, but without outgoing synapses. When neuron o, 
receives two spikes, it means that the computation of register machine M reaches 
instruction 6, and stops. Having two spikes inside, neuron o, spikes (a? — ais enabled) 


without emitting any spike out and the system Il halts. 
As explained above, it is obtained that any set of natural numbers accepted by deter- 


ministic register machines can be accepted by LHWSN P system IL. 


3.2 Analysis of Results 


Literature [20] proposed that spiking rules in neurons see the regular expressions as 
criteria of spiking and determining whether a regular expression satisfied may be a NP 
hard problem. This paper reduce the number of rules in neurons, greatly reducing the 
time of determining whether a regular expression 1s satisfied and then reducing the time 
of the whole module running. 

We assume that the time to judge whether the spiking rules are satisfied is T1, and 
whether the forgetting rules are satisfied is T2. Each step of the system needs to deter- 
mine whether the spiking and forgetting rules can be executed. In the following, We 
will compare the time required for LHWSNPS and HWSNPS [4] in each module. 

In the ADD module of INHWSPS, each neuron has 4 spiking rules, and the ADD 
module takes 5 steps from start to finish, So the total time required for the ADD module 
is 20T1; In ADD module of HWSNPS, each neuron has five spiking rules and two 
forgetting rules, and four steps are needed to finish ADD module, so the total time is 
20T1 + 8T2. Similarly, we can calculate the time consumed by two systems in the SUB 
module, FIN module and INPUT module respectively. Comparison of execution time 
of two systems can be seen in Table 3. 
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Table 3. The time comparison of two system 


Module | ADD module | SUB module FIN module INPUT module 


LHWSNP 20T1 8T1 3(n+3)T1 2(n+2)T1 
HWSNP 20T1+8T2 20T1+8T2 5(n+3)T1+2(n+3)T2 5(n+3)T1+2(n+3)T2 





4 Final Remarks 


In this work, inspired by the local homogeneity of biological neurons, we consider a 
restricted variant of SN P system (LHWSN P systems). In this system, each neuron in 
the same module has the same set of rules, which is having the local homogeneity. We 
investigate the computing power of LHWSN P systems. It is obtained that such systems 
only using standard spiking rules can achieve the Turing completeness. It reveals that 
we not only maintain the homogeneity of the system, but also reduce the time needed 
of the system. 

Many problems of LHWSN P systems need further researches. It is worth to consider 
that whether we can reduce the category of rules in the system without losing the univer- 
sality of LHWSN P systems. Also, the language generated by such systems is worthy 
to be studied. Moreover, the next step, we will consider use this SNP system to solve 
practical problems. 
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Abstract. Many consensus clustering methods ensemble all the basic partition- 
ings (BPs) with the same weight and without considering their contribution to 
consensus result. We use the Normalized Mutual Information (NMI) theory to 
design weight for BPs that participate in the integration, which highlights the 
contribution of the most diverse BPs. Then an efficient approach K-means is used 
for consensus clustering, which effectively improves the efficiency of combina- 
torics learning. Experiment on UCI dataset iris demonstrates the effective of the 
proposed algorithm in terms of clustering quality. 


Keywords: Consensus clustering - K-means - Basic partitionings 


1 Introduction 


There is no single clustering algorithm can performs best for all data sets [1], and can 
discover all types of cluster shapes and structures [2]. Consensus clustering approached 
are able to integrate multiple clustering solutions obtained from different data sources 
into a unified solution, and provide a more robust, stable and accurate final result [3]. 
However, the previous research still has some limitations. 

Firstly, high quality BPs are beneficial to the performance of consensus clustering 
yet the partitions with poor quality will lead to worse consensus result. But most studies 
tend to integrate all BPs, and they do not filter poor BPs. Secondly, the diversity between 
BPs also have great impact on consensus clustering, diverse BPs which means the BP 
that has different mutual information with other BPs will have different contribution to 
the consensus result. However, there are few references explore impact of the number 
of BPs to consensus clustering neither did they take into account the diversity of BPs 
into the integration process. 

We proposed weight-improved K-means-based consensus clustering (WIKCC). 
Firstly, we design weight for each BP participating in the integration. Specifically, 
we generate multiple BPs and measure the quality of each BP using normalized Rand 
index R, [6], and sort the BPs in the increasing order of R,, then we explore the 
influence of the number of BPs on consensus clustering, based on the above explo- 
ration we can choose an appropriate number of better BPs for consensus clustering, 
which can minimize the number of BPs in quality assurance. After that we construct 
the co-occurrence matrix with the selected BPs, and calculate the similarity of two 
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BPs with Normalized Mutual Information (NMI) method [4] according to the co- 
occurrence matrix. Then weight of each BP is designed according to NMI values 
which reflect a single BP to overall BPs’ similarity. K-means-based method [2] make 
special attention for its simplicity and high efficiency. So we transform consensus 
clustering to K-means clustering. 


2 Weight Design Based on the Normalized Mutual Information 


Mutual information is used to measure the shared information of the two distributions. 
We compute the NMI between two BPs, and the greater the value of NMI means the 
lower difference, which will result in lower w.. 


Given two BPs results x, with K, clusters, x, = ioe Ce sas rol and x, with K; 


clusters, zm. = Cc, Cc? ...C” \ the mutual information between two results is defined 
j {22 K, 
as follows: 








21, (T, T: 

NMI (q, r) = 5) (1) 

(m) + 1a (5;) 

Caci Aoa] 

L (1; T) -~ a 2 7 log Jefi (2) 

Ic | Ic? 
L (z;) =— >, t log = (3) 

Ic” ic” 
I,(x;) =- >, — log = (4) 

For a single BP the average mutual information can be defined as: 
1 r . 

H(z;) = E Di NMI (z, 2,),(i = 1,2, ... r) (5) 


Where h € {1, ae , k; }, le {1, re \ is one of the cluster result label of z; and 


(i) 
i C, 








; Ic? | respectively represent the number of the data set belong to cluster S in T; 
and CY in T; Cc? N cP is the number of the dataset belong both to C and C. r is the 
number of the BPs. 


The greater H (z;) indicate that cluster member z, share more information with other 
cluster members. The weight is defined as: 





ae (6) 
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The normalized form is defined as: 


Wn = YW, (7) 


m=1 Wm 
The weight is bigger as the greater diversity between two base clustering. 


3 The Weight-Improved K-Means-Based Consensus Clustering 


In this section, we first introduce the co-occurrence matrix which is used for records the 
situation of sharing dataset between two BPs. Table 1 shows an example. 


Table 1. The co-occurrence matrix 





In Table 1, BPs: z* and z; contain k and k; clusters respectively, ny 0) x represents the 


number of the objects that belongs to both C, and ‘CO, then let 


n, = >a O 1<j<K,1<k<K, Py = = ny /n, Pa = nefn and p“, = n% /n. We 
can obtain a oal co-occurrence aes (NCM), based on which we can compute 
the centroid of the K-means clustering. 

K-means algorithm cannot directly run on the co-occurrence matrix, so a binary data 
set is introduced to represent the result of r BPs. The binary data set 


eS { x1 <1<n}as follows: 


(b) __ (b) (b) (b) : 
X, Se ace with (8) 


x = ey ee ee and (9) 


Li Lil? a Lij 


, (10) 


b) _ 1, if object | belongs to the cluster C; in 7; 
Li 0, otherwise 


Where xj is an n x ))_, K; binary data set matrix with | x0) = 1. 
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We use the K-means algorithm to integrate the BPs, suppose r BPs are integrated to 
a result z*, m, represent the centroid of the C, in z* as follows: 


m, = (1g 15 ++ My js +++ My), With (11) 


ae ae (M i» ee SR a es Mix.) (12) 
The centroids of the K-means on X’ are represented as follows: 


(i) (i) (i) 
ee ee e T (13) 


m,; = en 
Prt Pit Prt 


yl 


The centroids can be computed by the Co- occurrence matrix, and m, is a vector of 


>, k; dimension. The element in the vector is computed by the number of shared data 
set between current cluster and all of the clusters of BPs. 

By using the co-occurrence matrix and the binary data set the consensus clustering 
are transformed to the K-means clustering, that is: 


max we w,U(a, 2") = min y Zag mM, ) (14) 


As shown in Fig. 1. In BPs generation phase, classic partition clustering method K- 
means is used, different initial number of cluster k, to generate diversified BPs. In 
consensus clustering phase, after generating the BPs and computing the weight for each 
clustering member, we can obtain the weighted co-occurrence matrix, and then we can 





Algorithm WIKCC 


Require: 
Input: a data set of known class label 
Ensure: 
Using K-means to generate r BPs called | | 
Construct the binary data set X® from |] 
The Co- occurrence matrix is constructed by using BPs 
Compute the weight of r BPs as {Wy ,W3,...,.W,} by using NMI 


Reconstruct the binary data set X using weight 


Run K-means to cluster X into K clusters and get 7* 
Return 7° 





Fig. 1. Algorithm WIKCC 
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get the weighted binary dataset X®”, by running K-means on weighted binary dataset 
XOV, we can get final consensus result z*. 


4 Experimental Results 


We present experiment on UCI dataset iris. The normalized Rand index (R,) [6] is 
adopted. Its value usually range between [0,1]. The higher value, indicate that the higher 
quality of clustering. We demonstrate the cluster validity of WIKCC by comparing it 
with two well-known consensus clustering algorithms the K-means-based algorithm 
(KCC) [2], the hierarchical algorithm (HCC) [5]. 


4.1 Quality of BPs 


We run K-means algorithm 100 times with the initialized number of clusters randomized 


within [K, y/n] to generate 100 basic partitionings (BPs) for consensus clustering; K is 
the true class of data set, n is the number of the instances, the squared Euclidean distance 
is used for the distance function, the quality of each BPs is measured by R, the distri- 
bution of quality of BPs is shown as Fig. 2. 
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Fig. 2. Clustering quality distribution of BPs 
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As can be seen in Fig. 2, the distribution of the clustering quality of the BPs show 
that there is a large proportion of BPs with poor quality, but only quite a small proportion 
of BPs with relatively high quality. This shows that the incorrect pre-specified number 
of classes will lead to weak clustering result. 


4.2 Exploration of Impact Factors 


In order to In order to determine a suitable number of BPs for WIKCC, we explore the 
influence of the number of BPs on consensus clustering. In the above experiment, r BPs 
have generated, and r = 100. We randomly select a part of BPs to obtain the subset 


IT: with r = 10, 20, ..., 90. For each r we do KCC [2] algorithm 100 times to get 100 
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consensus clustering result. The distribution of the quality of consensus clustering result 
for different subset is shown as Fig. 3. 
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Fig. 3. Impact of the number of BPs to the consensus clustering 


As shown in Fig. 3, when r < 50, the quality of the consensus result present increasing 
trend with the increase of r, but when r > 50 the result fluctuate in a mall range and 
nearly tend to be stable, it imply that 50 may be the appropriate number of BPs for 
WIKCC. Based on above exploration we chose the BPs with the quality of the top 50 
BPs for WIKCC. 


4.3  WIKCC versus Other Clustering Methods 


We compare the WIKCC with KCC and HCC, we choose top 50 better BPs for each 
method and run on the iris dataset for 10 times. 
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Fig. 4. WIKCC versus KCC and HCC 


We can see in Fig. 4. The WIKCC shows significantly higher than KCC, and outper- 
forms better than the HCC in term of the quality of consensus clustering. In addition, 
comprising the Figs. 2 and 4, we can see that consensus clustering is much better than 
almost all the basic clustering result obtained by K-means, this indicates that, the 
consensus clustering method can find the real cluster structure more accurately than a 
single traditional clustering algorithm by integrating the commonality of many basic 
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clustering results, so it can obtain a more stable and accurate clustering result by 
ensemble multiple weak BPS. 


5 Concluding Remarks 


We explore the influence of the number of BPs on the consensus clustering and chose 
appropriate number better BPs for WIKCC. The weight is designed by the NMI method 
between two BPs based on co-occurrence matrix. Finally, the experiment on iris demon- 
strates that WIKCC outperforms the state-of-the-art well-known KCC and HCC algo- 
rithms in terms of clustering quality. In the future, we will explore the more other factors 
that have influence on the performance of KCC, and we will consider more other factors 
when designing the weights. 
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Abstract. Whenever traditional process discovery techniques are confronted 
with complex and flexible environments, equipping all the traces with just one 
single model might lead to a spaghetti-like process description. Trace clustering 
which splits the logs into clusters and applies discovery algorithm per cluster has 
affirmed to be a versatile solution for that. Nevertheless, most trace clustering 
techniques are not precise enough due to the indiscriminate treatment on the 
activities captured in traces. As a result, the impacts of some important activities 
are reduced and some typical information may be distorted or even lost during 
comparison. In this paper, we propose a novel trace clustering technique that 
based on constrained traces alignment and then adapt two appropriate clustering 
strategies into process mining perspective. And experiments on real-life event 
logs show that our technique has compelling outperformance in terms of process 
models complexity and comprehensibility. 


Keywords: Constrained Trace Clustering - Trace clustering - Process mining 
Business process management - Constrained Trace Alignment 


1 Introduction 


Process discovery is one of the most crucial process mining tasks that entails the 
construction of process models from event logs of information systems [1]. The most 
arduous challenge for process discovery is tackling the problem that discovery algo- 
rithms are unable to generate accurate and comprehensible process models out of event 
logs stemming from highly flexible environments. 

Trace Clustering is an efficient solution, which clusters the traces such that each 
of the resulting clusters corresponds to coherent sets of cases that can each be 
adequately represented by a process model [3]. Figure 1 shows the basic procedure for 
trace clustering. 
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Fig. 1. Illustration of the basic procedure of trace clustering in process mining. 


Nevertheless, most currently available trace clustering techniques are not precise 
enough due to the indiscriminate treatment on the activities captured in traces. As a 
result, the impacts of some important activities are reduced and some typical process 
information may be distorted or even lost during comparison. 

To address the drawback, this paper presents a novel similarity measurement based 
on constrained traces alignment. First, some typical causal sequences that reflect the 
“backbone” of process are identified. Then, these sequences are exploited as constraints 
to guarantee the priority of important activities in traces. Subsequently, we suggest two 
clustering strategies that agree with the process mining perspective. The agglomerative 
hierarchical clustering (AHC) was selected for its embedded flexibility on abstraction 
level to provide us an overall insight into the complex process. And the spectral clus- 
tering has a good recommendation about the number of clusters corresponding to the 
generic abstraction level. 

In brief, this work contributes by proposing a novel constrained trace similarity 
measurement to guarantee the priority of important process episodes and subsequently 
adapting two appropriate clustering techniques into process mining perspective. In 
addition, experiments on real-life logs prove the improvements achieved by our method 
relative to six existing methods. 

The rest of the paper is organized as follows: Sect. 2 provides a brief overview of 
related works. Next, Sect. 3 introduces our novel constrained trace similarity measure- 
ment and the process-adaptive clustering strategies we selected. And Sect. 4 discusses 
the experiment results. Finally, Sect. 5 draws conclusions and spells out directions for 
future work. 
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2 Related Work 


The main distinction between trace clustering techniques is the clustering bias (distance/ 
similarity measures) they proposed. Existing approaches in literature can be classified 
into two major categories. 


2.1 Distance-Based Trace Clustering 


2.1.1 Vector-Based Trace Clustering 

Vector-based trace clustering approaches transform traces into a vector space. Then, 
clustering can be achieved combining different distance metrics in the vector space. 
Greco et al. [8] were pioneers in study of clustering log traces within the process mining 
domain. They make trace clusters through the vector space over the activities and their 
transitions to discover expressive process models at the first attempt. They also intro- 
duced a notion of disjunctive workflow schemas (DWS) for divisive trace clustering [5]. 
Song et al. [13] elaborated on constructing so-called profiles associated with multiple 
trace perspectives as the feature vector. 


2.1.2 Context-Aware Trace Clustering 

Context-aware trace clustering approaches regard the entire trace as a whole sequence 
which implies all the process context information. Then various string edit distance 
metrics can be applied on it in conjunction with standard clustering techniques. In [2], 
Bose and van der Aalst propose a generic edit distance which derives specific edit oper- 
ation costs so as to take into account the behavior in traces. The context-aware method 
is further developed in [3], it leverages conserved patterns or subsequences as feature 
sets to describe the characteristic of a certain trace. 


2.2 Model-Based Trace Clustering 


2.2.1 Sequence Clustering 

Sequence clustering algorithm creates first-order Markov chains for clusters cooperating 
with the expectation-maximization (EM) algorithm to determine the assignment of a 
certain sequence. It has been used to automatically group large protein datasets to search 
for homologous gene sequences in bioinformatics. This technique was migrated into 
trace clustering by [6]. 


2.2.2 Active Trace Clustering 

Active trace clustering inherits the underlying idea of sequence clustering [15]. Therein 
a trace is added to the current cluster if the model discovered from the cluster including 
that trace satisfies the target threshold of fitness. An optimal distribution of execution 
traces over a given number of clusters is achieved whereby the combined accuracy of 
the associated process models is maximized. In this way, the quality of process model 
discovered is under control. More extension on it has been developed to support further 
objectives in [7]. 


56 P. Wang et al. 


3 Approach Design 


Distance/similarity measurement and clustering strategy with its specific characteristics 
are both important cluster-theoretical aspects. Therefore, we introduce our approach in 
the two steps. The framework of the approach in this paper is depicted in Fig. 2. 
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Fig. 2. Framework of Constrained Trace Clustering. 


3.1 Similarity Measurement Based on Constrained Traces Alignment 


Just as noted by [10], “specifying an appropriate dissimilarity measure is far more 
important than choice of clustering algorithm. This aspect of the problem is emphasized 
less in the clustering literature since it depends on domain knowledge specifics.” There- 
fore, it is inevitable to refine the distance/similarity metrics in trace clustering for 
providing more appropriate information to the clustering algorithm. 

We can perceive that identifying some significant behaviors in traces will assist in 
mining better sub-process models by clustering the traces based on those significant 
behaviors. However, due to the indiscriminate treatment on behaviors in traces and the 
lack of domain knowledge, capturing them directly from event logs seems to be a difficult 
task. 

Fortunately, the association rules in data mining shed light on this tough task. 
Employing the association rules, we are able to reveal the “backbone” of process. Then, 
some typical causal sequences that reflect the process backbone are identified and 
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exploited as constraints to guarantee the priority of significant behaviors during simi- 
larity comparison. 


3.1.1 Dependency Measures to Reveal Process Backbone 
Definition 1. (Dependency Measures) Let L be an event log over A. a and bare activities 
that occur in L, 1.e.; |o| denotes the length of the trace. 


la >; b| is the number of times a is directly followed by b in, 1.e., 


la >, b= YLO)x|{1 <i< lol lo) = a^ oli + 1) = b} (1) 


oEL 


la >; b| is the value of the dependency relation between a and b: 


E ie a ed ee 
la >, b)+[b >, a 41 ; 

ja >, a| f ; (2) 
ly ada = 


|a >, b| = 


la >, al +1 


|a =, b| produces a value between —1 and 1. If|a >, b|is close to 1, then there is a 
strong positive dependency between a and b. 


By setting p, the threshold of |a >, b|, we can filter out the infrequent items. And 


when la >; b| meets certain thresholds v, we specified that there is a connection between 
a and b. 
To illustrate the basic concepts, we use the following event log L: 


L=[|< a,d,e,f,g,i >!,< a,b,e,f,i>',<a,c,e,f,h,i>',<a,c,b,e,f,hi>', 
<a,e,f,g,i>,<d,d,a,d,e,f,g,i>',< a,b,c,e,f,h,i>,< a,d,d,d,e,f,¢,i>'] 


Figure 3(A) depicts the dependency graph corresponding to the threshold of u = 2, 
v = 0.7. And Fig. 3(B) is another dependency graph with the threshold of u = 5, v = 0.85. 
Obviously, the dependency graph does not show the routing logic but it reveals the 
“backbone” of the process model. The derived dependency graph (denote as G) is then 
used as a reference to reveal the general order of some typical activities. 





Fig. 3. Dependency graphs according to the dependency measures. 
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3.1.2 Constrained Similarity Measurement 

Let A, represents the set of activities that involved in trace c; (known as Alphabet) while 
B represents the set of activities that involved in the “process backbone”. A, AA; NA B 
are the common activities between o, and o, with respect to the “process backbone”. 
Referring to the dependency graph mentioned above, we can get the causal sequences 
of them. For example, the involved common activities between traces 
< d,d,a,d,e,f,g,i>, <a,d,d,d,e,f,g,i> in L and the dependency graph are 
< a,d,e,f,g,i >, their constraints are shown as the black highlights in Fig. 4. We call 
these common activities that attached with causal sequences as typical behaviors. 





Fig. 4. Illustration of constraint instance. 


These behaviors provide an approximation of the essence of the both comparing 
traces in a global perspective. Next, we utilize them as constraint conditions to guarantee 
the priority of typical behaviors between traces. In the following, the typical behaviors 


between traces o, and o, concerning the process backbone are denoted as C; ;. 


Definition 2. (Constrained Similarity Measurement) o; and o, are two traces, the 


constraints of them are C, p Then, the similarity of o, and O;, Sim(O;, 0) is defined as: 
length(CLCS(o,, o; Cj) 


Si se a aa TA 
moa) max(length(o;), length (6;)) 


(3) 


The constrained measurement is relied on the constrained longest common 
sequences (CLCS). The CLCS has already been applied in bioinformatics for the compu- 
tation of the homology of two biological sequences. For more details about CLCS, please 
refer to [14]. 


3.2 Adapted Trace Clustering Strategies 


Traditional trace clustering only adapts data-centric clustering algorithms. However, as 
described in [3], the most important evaluation dimension for trace clustering is from a 
process discovery perspective. This entails the compatibility with process features on 
the adopted clustering strategies. Here, we suggest two apposite clustering strategies for 
different applications. 
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3.2.1 Agglomerative Hierarchical Clustering 

Thanks to the hierarchical characteristic of AHC algorithm, it pertinently agrees with 
the continuum ranging from unstructured processes to structured processes. The bottom 
of AHC means clusters corresponding to each trace whose processes mined are surely 
structured while the top represents the only cluster that contains everything whose 
process mined is usually unstructured when confronted with flexible environment. Thus, 
we are able to ascertain the applicable level as desired or traverse all the hierarchy 
straightforward to gain an overall insight into the complex process. 

The method we adopt is proposed by [4] termed as GuideTreeMiner. The Guide- 
TreeMiner uses AHC algorithm to build a guide tree (also known as dendrogram). Any 
horizontal line spanning over the dendrogram corresponds to a practical clustering at a 
specific abstraction level. 


3.2.2 Spectral Clustering 

Except for hierarchical clustering algorithms, most of the trace clustering require prede- 
fined parameters for clustering such as the amount of clusters, the maximum cluster size 
etc. The truth is the definition of these specialized parameters are far from easy for 
general users due to the lack of domain knowledge. Actually, even for experts, it’s also 
not a trivial work as well owing to the complexity and flexibility of real-life process. 
Against this background, the spectral clustering was selected as it provides a good 
recommendation about the number of clusters [11] which can guide us to a generic 
abstraction level. 

It’s worth point out that the affinity matrix always calculated as Gaussian kernel, 
however, it doesn’t reflect the nature of processes. So, in this work, it is calculated as 
the constrained similarity described in the previous section. Likewise, the /aplacian is 
often normalized, but duo to the robustness of constrained similarity measurement 
against infrequency and the pursuit of stable clustering indication, we select the non- 
normalized laplacian. As for the indication about the cluster number k, it can be recog- 
nized whereby the sudden drop in the eigenvalues. Actually, in many cases, the two 
solutions can be used in union. 


4 Experiments 


4.1 Experiment Configuration and Evaluation Criterion 


We used the ProM! framework which has been developed to support process mining 
algorithms to perform the experiments. The data is from Dutch Financial Institute’. And 
we adopted the HeuristicsMiner to derive the process model as it has the best capability 
to deal with real-life event logs. The approaches to compare with are presented as 
follows: DWS Mining [5], Trace Clustering [13], GuideTree Miner [2, 3], Sequence 
Clustering [6] and ActiTraC [15]. 


http://www.processmining.org. 
: http://www.win.tue.nl/bpi/doku.php?id=2012:challenge. 
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We evaluate the results with respect to their model complexity, as they are measured 
by a comprehensive list of metrics reported in [12]: 


|A| signifies the number of arcs in the process model. 
|N| signifies the number of nodes in the process model. 
|CN| = |A| — |N| + 1 signifies the cyclomatic number of the process model. 


CNC = 7 pj Signifies the coefficient of connectivity of the process model. 


J ee a A 


A 
IN|:|N—1]| 





signifies the density of the process model. 


4.2 Clustering Results 


We made comparisons with different number of clusters for three different abstraction 
levels. Here, level 1 represents the original trace set, 1.e. there is only one cluster. Level 
2 stands for 2/3/4 clusters while level 3 contains 5/6/7 clusters. We calculated IAI by 
taking their corresponding nodes weighted average and the same as INI. 

The aggregated results are presented in Table 1. All the data has been depicted to 
the radarplots in Fig. 5. We can see that all cluster techniques lead to models with lower 
complexity than the original log file. However, the DWS, the ActiTraC and the Trace 
Clustering approaches lead to clusters whose models have higher density values than 
the unclustered one though they perform well in the other metrics. The smaller area and 
more balanced capabilities shown in the radarplots from two abstraction levels proved 
the effectiveness of our constraints. 


Table 1. The aggregated clustering results 


level 


Levell | Unclustered | 1 | 141.0 | 36.0 | 3.917 106.0 | 0.112 
Level 2 a Saa 65.6 | 0.082 
ws ueo 5 2w 39.8 | 0.129 
TO | 30-3) | 500 |167 |2994 | 33.3 | 0.191 
ATC | 46-Std) | 21.5 |120 |1792 | 0.163 
om fs | sss [34s |250| 20 Toon 
MRA |3 | 944 |346 |2728 | 59.8 | 0.081 
CoTC |3 | 86.5 |35.3 |2450 | 51.2 | 0.071 
Level 3 E a ee 39.8 | 0.064 
DWS esso | 47 190 2195 22.7 | 0.122 
ATC | 76-Std) | 15.6 |101 |1544 | 5.5_| 0.170 
GE fe ls [se zn 29 Toon 
MRA |6 | 823 |339 |2428 | 48.4 | 0.074 
Co-TC [6 | 73.5 |331 |2221 | 39.4 | 0.069 
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Fig. 5. Radarplots of different trace clustering techniques. 


Moreover, Fig. 6 depicts Constrained Trace Clustering at different levels. With the 
increasing number of clusters, there is only a little improvement in all aspects. Consid- 
ering the extra elaboration on more clusters, it is inefficient and meaningless to set the 
number of clusters to a higher value. This is consistent with the spectral clustering. Just 
as the eigenvalues scatterplot shown in Fig. 7, only the first three clusters are well sepa- 
rated. Therefore, the spectral clustering can guide us to correctly capture the right level 
of abstraction by providing a good recommendation about the number of clusters instead 
of the iterations on different hierarchies. 
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Fig. 6. Constrained Trace Clustering from Fig. 7. Similarity matrix eigenvalues. 


different abstraction levels. 


In a nutshell, all these experiments confirm the effectiveness and efficiency of 
Constrained Trace Clustering to deliver comprehensible process models from the flex- 
ible and complex logs. 
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5 Conclusions and Future Perspectives 


In this paper, we contribute to trace clustering techniques by imposing constraints on 
trace similarity/distance measurements to guarantee the priority of important activities 
in traces. By this means, significant process information is preserved as much as possible 
such that more accurate trace similarity measurement will be obtained. Moreover, we 
integrate two clustering strategies that agree with the process mining perspective to 
cluster these traces. 

There are still a number of challenging issues remain open for future research work. 
Firstly, more refined similarity/distance measurements that felicitously agree with the 
process domain knowledge are encouraged. On the issues of processing sequence data, 
we should learn from bioinformatics which has more mature applications on sequence 
data mining. Another direction of research might be that of integrating advanced clus- 
tering strategies, as currently available techniques only adapt traditional data clustering 
techniques which are data-centric instead of process-centric into process mining. More 
generally, designing ad hoc clustering strategies usually leads to more suitable clustering 
results. Finally, trace clustering techniques only take into account of sorting in the hori- 
zontal direction, it would be promising to combine with techniques that focus on 
abstraction in the vertical direction, such as FuzzyMiner [9] which enforces cartography 
to the process mining by means of clustering activities. 
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Abstract. Representing local image patches is a key step in many appli- 
cations of computer vision, while fast and effective description methods 
are always required by real-time image processing. Motivated by the 
fact that quantization compresses information while preserving primary 
structures, in this paper, we propose to use vector quantization (VQ) on 
local patch descriptor building. Compared to conventional approaches 
that compress floating-point features with VQ, we produce local inte- 
ger descriptors very fast directly based on simple quantization meth- 
ods. Experimental results on a publicly available dataset show that the 
present method is efficient both to build and to match. It achieves compa- 
rable performance to some typical floating-point and binary descriptors 
such as SIFT and BRIEF, offering a novel solution to fast local image 
representation except for bit test created in BRIEF. 





Keywords: Local image descriptor - Vector quantization 
Feature matching - Cumulative distribution function 


1 Introduction 


As one of the vital problems in image processing and understanding, image 
representation is a core research of many computer vision applications, such 
as image registration, image retrieval, and object classification. This task can 
be generally divided into two main steps: the extraction of feature points and 
the description of these keypoints [9]. During the past decades, Scale Invariant 
Feature Transform (SIFT) [11] proposed by Lowe was once a benchmark for 
its excellent performance. And Speeded Up Robust Features (SURF) [2] which 
exhibits much higher efficiency than SIFT has then become a de facto standard. 

SURF addresses the issue of speed, but, since the SURF feature is still a 
vector of floating point values, the storage and computation of features is still 
expensive. ‘Thus, a number of techniques are proposed from different aspects to 
reduce memory consumption and speed up matching of features. The first direct 
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way is to work with short descriptors, which is obtained by applying dimension- 
ality reduction on computed features. Principal Component Analysis (PCA) or 
Linear Discriminant Embedding (LDE) [8] can be easily used to reduce descrip- 
tor size without loss in recognition performance [13]. Another way to shorten a 
descriptor is to quantize its floating-point coordinates into integers [20] or bina- 
rize the feature with a few of bits. Quantization is a simple additional operation 
yielding better memory gain and faster matching, and both CHoG [5] and PQ- 
WGLOH [21] use vector quantization to build compact descriptors. Binarization 
motivated by Locality Sensitive Hashing (LSH) [6] is usually achieved by using 
hash functions to turn floating-point vectors into binary strings. Such implemen- 
tation especially on SIFT and GIST [15] can be easily found in the literatures 
of improving existing features [14,19]. 

Though different approaches of dimensionality reduction are added, the con- 
struction of original full descriptor is still indispensable. In other words, the 
substantial amount of time-consuming computation has not be reduced yet. 
With the development of mobile devices, algorithms with low computation com- 
plexity are in great demand. Binary Robust Independent Elementary Features 
(BRIEF) proposed by Calonder et al. [4] constructs the descriptor by directly 
comparing pairwise pixel intensities to produce a binary string for an image 
patch. Other outstanding features based on bit test include Oriented steered 
BRIEF (ORB) [16], Binary Robust Invariant Scalable Keypoints (BRISK) [10], 
Fast Retina Keypoint (FREAK) [1], etc. 

Different from BRIEF-like descriptors, Locally Uniform Comparison Image 
Descriptor (LUCID) [22] attempts to describe an image patch from the view of 
order permutation of pixel intensities. The employment of permutation brings in 
a better resistance to monotonic brightness changes, meanwhile the linear-time 
construction and integer representation of LUCID descriptor lead to a good time 
efficiency. Local Image Permutation Interval Descriptor (LIPID) [18] improved 
the robustness of LUCID by means of zone division. These descriptors create 
fast and short patch representation in a different way from bit test. 

As we know, many descriptors are constructed based on gradient, histogram, 
bit comparison and filtering. Quantization is also utilized as a postprocessing 
step to shorten a descriptor as we have mentioned above. But to our knowl- 
edge, no one makes a direct use of it in descriptor building. In this paper, we 
show that a local image patch can be described by virtue of patch quantization, 
which produces a descriptor that is very fast both to build and to match. We 
study two simple ways to implement the patch vector quantization and compare 
the respective performance. The experimental results show that the proposed 
method is sufficient to obtain good matching results, as long as invariance to 
large in-plane rotations is not a requirement. Moreover, the quantization-based 
method employs integer descriptors which avoid floating point computation, so 
it is promising to be time efficient. In the rest of this paper, Sect. 2 will describe 
the overall methodology, experimental results and discussions are provided in 
Sect. 3, and conclusions are drawn in Sect. 4. 
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2 Methods 


2.1 Image Patch Quantization 


Quantization methods have been thoroughly studied in the field of signal pro- 
cessing, including the coding and compression of different signals. Image quan- 
tization represents the image with finite gray levels, whose idea is exactly the 
same as we desire for image descriptors — represent the most primary informa- 
tion with less data. Inspired by this, we attempt to use vector quantization on 
image patches for descriptor construction. Quantizing an image patch refers to 
the quantization of the gray scales. A scalar quantization can be understood as 
a mapping of an input into a finite number of output. A straight forward way is 
the uniform quantization, which divides the gray scales equally. It is the simplest 
and fastest way of implementation, and the intensity mapping function for pixels 
within a patch is as following: 





(p(x, y) — minv) Ng 


= 1 
F(p) maxv — minv + 1 ( ) 
where 
minv = min x,y), marv = max x. 2 
Beee y) ee oP y) (2) 


and p(x, y) is the pixel intensity in a patch P, N; is the number of quantization 
levels, and |-| refers to rounding down. 

Though the above method is a fast and reasonable solution, it may be too 
simple and crude for not taking intensity distribution into account. Another way 
to quantize the patch lays emphasis on equal pixel quantities. This non-uniform 
quantization divides the gray scales into disjoint intervals of equal probability by 
virtue of the Cumulative distribution function (CDF) [3]. The original histogram 
of an image patch counts the number of pixels whose intensities fall into each 
gray level, which can be depicted as: 


N(i) = N (p(z, y) = i) (3) 
and the CDF of it is calculated as: 


CDF()= S> NG) (4) 


J=MIiNv 


where minv is defined in Eq. (2). CDF counts the number of pixels whose inten- 
sities are below each gray level, which provides us an expression for quantity 
partition. For a CDF function, it is obvious that CDF (maxv) = Npaten (pixel 
number of the whole patch), and CDF(minv) = N(minv) (number of pixels 
with the smallest intensity). So the intensity mapping function can be written 
in two forms: 
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E (CDF (i) — CDF (minv)) Ng (5) 
a CDF (maxv) — CDF (minv) +1 
7 a — N(minv)) | (6) 
© | Nopatch — N(minv) +1 
Figure 1 illustrates how these two quantization methods work by plotting 
histograms of an image patch and lines which indicate the dividing gray levels. 
Figure la shows that the first method sets uniform threshold levels within the 
range of gray levels. Figurelb and c describe the process of the second app- 
roach. Figure 1b plots the CDF we use to get threshold levels which ensures an 
equivalent number of pixels in each interval. The green horizontal lines in Fig. 1b 
indicate the uniform partition on pixel quantities, and intersection points of these 
lines and CDF function show the locations of thresholds (red lines). Figure 1c 
shows these non-uniform thresholds which are obtained in Fig. 1b on gray scales. 
Sum of pixels within each interval is approximately the same. 
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Fig. 1. Illustration of quantization approaches on image histograms. (a) Quantization 
based on equal division on gray levels. (b) Cumulative distribution function and equal 
division on pixel numbers. (c) Quantization based on equal pixel quantities. (Color 
figure online) 


2.2 Descriptor Construction 





Once the intensity mapping function is given, we can obtain a matrix by mapping 
all pixels into new values within the patch. Then by vectorizing it, we get the 
feature vector as the patch descriptor. Our descriptors take the form of integers, 
within the range of 0 to Ng, where Ng equals the number of quantization levels. 
A diagram shown as Fig. 2 depicts the descriptor construction process. 

The main parameters our descriptors may concern should be the quantization 
levels and patch size. Too few levels will result in a loss of feature distinctiveness, 
but an excess of levels will also overlook the intention of feature extraction. 
As for the size of an image patch, a proper size is also demanded. Oversize 
patches will no longer contribute to description effectiveness but cause a waste on 
computation power. Usually, parameters are always trade-offs between precision 
and recall, accuracy and efficiency. We provide some experiments on parameter 
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Fig. 2. The process of descriptor construction. 


selection in Sect. 3.2, from which, some empirical parameters can be concluded 
for practical uses. 

We discover the descriptor quantized via CDF is actually a order- 
permutation-based method, which should be invariant to monotonic illumination 
changes |17|. As we know in combinatorics, permutation is a bijective mapping 
of a finite set onto itself. It can be understood as a sequence involving every 
element in the finite set which appears only once. We use X = (£1, %2,...,%n) 
to denote the vector of an image patch, and vectorization is arbitrary (e.g, row- 
major) but fixed for all patches. Let Z = (0, ...,0, 1, ...1, ..., Ng — 1, ..., Ng — 1) be 
a vector associated with X, where Z contains all the quantized values of each 
element (pixel intensity) in X. Z is a fixed vector and we define a bijective map- 
ping or action 7 to denote the corresponding relationship between X and Z via 
the mapping function in Eq. (6). From the CDF-based quantization process, we 
know that this action basically concerns the orders of the elements rather than 
the values. Hence, the descriptor D is a permutation on Z, where m~t is applied 
on Z. As we imagine, the descriptor based on patch quantization via CDF is 
promising to be robust to monotonic changes. 











3 Experiments 


In this section, we compare our method with several competing descriptors, and 
a real-world dataset that is publicly available is employed in the experiments. 
Parameters of the quantization-based descriptors are discussed, and performance 
evaluation including recognition rates and computation time is studied. 


3.1 Experiment Setup 


Our methods are tested on the widely-used datasets introduced by Mikolajczyk 
and Schmid [13], as shown in Fig. 3. The dataset contains several image sequences, 
each sequence includes six images of a same scene, with varying degrees of a same 
image transformation. They are designed to test robustness to typical image dis- 
turbances that often occurs in real-world scenarios. For all sequences, we con- 
sider five image pairs by matching the first image to the remaining five ones. Our 
quantization-based methods are marked as Q1 (uniform quantization) and Q2 
(using CDF). All the experiments are implemented by OpenCV 2.4.4 on a PC with 
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Fig. 3. Oxford dataset introduced by Mikolajczyk and Schmid. 


2.93 GHz i7 CPU and Windows 7 Professional 64-bit OS, since OpenCV provides 
a convenient implementation of most state-of-the-art features. 

Although our descriptor can be computed on arbitrary kinds of keypoint 
detector, we first carry out the comparison on SURF points. And in the sit- 
uation that considers scaling and rotation correction, we would rather employ 
the AGAST [12] detector that BRISK feature uses for its fast speed. After key- 
points are extracted from each images, 1000 points are selected from them based 
on their rankings of Harris corner measure |7|. We evaluate the performance of 
descriptors using the recognition rate as BRIEF does. As for the strategy to 
determine what is a correct match, we choose the Nearest Neighbor Distance 
Ratio (NNDR) [13]. This strategy is more strict than the Nearest Neighbor one, 
and produce less but more accurate correspondences. The distance ratio is set 
as 0.9 in all the experiments. 





3.2 Parameter Selection 


We first test the impact of parameters in our methods in order to find a group of 
well-performed parameters. Figure 4 plots the variations of patch size and num- 
ber of quantization levels when the other parameter is fixed. As the patch size 
increases, recognition rate of image pair matching is improving explicitly. The 
curves increase as an exponential function and approach the upper limit when 
the patch size is above 16. Hereafter, larger patch size will not contribute to an 
obvious improvement on performance but bring in more computation require- 
ment. Results of Quantization levels have the similar findings as the former 
parameter. Recognition rate increases rapidly when the quantization level grows 
under 8, but starts to level off with slight fluctuations after that. To sum up, 
we consider a patch of 16 x 16 which is quantized to 8 levels is a good choice of 
empirical parameters. Therefore, we use this group of parameters in the latter 
tests against other descriptors. 
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Fig. 4. Impact of parameter variations in quantization-based descriptors. (a) Recog- 
nition rate changes as patch size alters, number of quantization level is fixed to 8. 
(b) Recognition rate changes as number of quantization level alters, a 16 x 16 patch 
size is selected. Experiments are carried out on image pair Wall 1 and 2. 


3.3 Performance Evaluation 


Because our original descriptor does not correct for scale and orientation, we 
first carry out a performance evaluation of several comparable but competing 
description approaches. Chief among them are SURF and BRIEF descriptors 
provided by OpenCV. The single-scale and orientation-ignored (U stands for 
upright) SURF (SU-SURF) is also employed for a fair comparison. Moreover, 
we supplement LIPID rather than LUCID as a representative of permutation- 
based description methods since LIPID is more robust than LUCID. Note that 
all methods involved in this comparison except SURF are scale and orientation 
ignored, while SURF is included as a reference of rotation and scale invariant 
feature. In addition, the SURF keypoints are detected for all descriptors as we 
have explained above, and Boat sequence for zoom and rotation test is also 
ignored here. Results of recognition rates are given by Fig. 5. 

Our methods outperform most of the others on Wall, Leuven and Ubc images. 
For the two sequences of image blur, our algorithms work very well on minor 
distortions but degrade faster than BRIEF and SURF. SURF features which 
are based on gradient truly suffer a lot from image blurring. As for the Graf- 
fiti sequence, most involved descriptors cannot undergo such an extreme affine 
warping. Q1 and Q2 show obvious advantages on texture scene images and com- 
pression artifacts. Q2 performs much better than Q1 on monotonic illumination 
changes probably because Q2 descriptor inherits inherent robustness from order 
permutation, and LIPID shows a talent on Leuven sequence because it is also 
permutation-based. Overall, our quantization-based descriptors perform well on 
many different deformations, and Q2 based on non-uniform quantization shows 
a better recognition accuracy than Q1. 

With consideration of correcting orientation and scale for the descriptors, 
we further compare several state-of-the-art invariant features in the following 
experiment. SIFT and SURF are benchmarks of the scale and rotation invariant 
features, while BRISK is a remarkable binary feature that handles the invariance 
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Fig. 5. Recognition rates of different description methods 
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Fig. 6. Recognition rates of orientation and scale corrected methods 
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problem well. On account of the overall better performance of CDF-based descrip- 
tor in the previous experiment, only Q2 is tested during this part. All the features 
participated in use their own keypoint detectors, and Q2 employs the multi-scale 
AGAST detector that BRISK uses [10] for its best speed. Results of these four 
features are shown in Fig. 6. From the rates, it is concluded that the quantization- 
based descriptor outperforms SURF and BRISK in most cases. SIFT feature 
shows better results than Q2 in viewpoint change sequence, but poorer ones than 
the latter under light changes and compression artifacts. For the severe blurring 
pairs, multi-scale AGAST detector has a poor repeatability, therefore Q2 performs 
well on slight image blur pairs and degrades rapidly ever after. As for scale and 
rotation changes, AGAST detector is inferior to SIFT detector on orientation con- 
sistency, so the quantization-based feature shows a tendency that it outperforms 
SIFT under small deformation while yields poorer results than SIFT as the defor- 
mation increases. 
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Fig. 7. NCC values for patches from Wall image 


We further evaluate the noise resistance of our descriptor. We add Gaus- 
sian noise of zero mean and 0.01 variance into Wall 1 image, and extract fea- 
ture vectors from the original image and the additive noise image respectively. 
Approximate 200 keypoints are detected in the original image and descriptors 
of these 200 patches are computed. SURF, BRISK and our method are involved 
for evaluation, and Normalized Cross Correlation (NCC) is used to measure the 
similarity of the two feature vectors before and after adding noise. As shown in 
Fig. 7, the NCC value of our method is higher than BRISK but lower than SURF, 
which shows that SURF is the most sensitive to image noises. As BRISK employs 
Gaussian smooth before the descriptor building (pixel sampling and bit compar- 
ison), it is more robust to Gaussian noise in our evaluation. The real matching 
experiment shown in Fig. 8 further illustrates the noise resistance of the involved 
features. The wall images added with different degree of noises are matched to 
itself, and the recognition rates are computed and plotted as the variance of 
Gaussian noise varies. As the noise increases, BRISK shows the best resistance, 
and performance of SURF has the most severe degradation. Our method retains 








A Fast Local Image Descriptor Based on Patch Quantization 73 


good performance under slight noise and begins to degrade gradually when noise 
keeps increasing. 

Finally we provide the time efficiency estimation of the proposed method. 
Time complexity and CPU running time are included as the evaluation of time 
analysis. The uniform quantization can be computed in linear time with no 
doubt, and no extra space is need either. And the CDF-based quantization 
requires an auxiliary array to store the CDF function, which is also O(n) in 
time complexity. Merely, it operates two linear-time computations, one for the 
CDF and the other for the mapping, hence it should be more time-consuming 
than the uniform one. ‘Table 1 lists the description and matching time of different 
descriptors on image pair Wall 1-2. For each image, top 1000 points are selected 
according to Harris measure from SURF keypoints, and time shown in the table 
is the average computation time per point. For a fair assessment, only descrip- 
tion time is compared here and all the descriptors except SURF do not take 
scale and rotation into account in the description process. It is seen from the 
table that the Q1 quantization method shows a definite superiority on descrip- 
tion time consumption. Though Q2 requires more time than the former, it shares 
a similar rank with BRIEF. In general, our methods perform competitively well 
as state-of-the-art features and require very little computation time. 
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Fig. 8. Recognition performance under different degrees of Gaussian noises. 


Table 1. Comparison of computational timings. 


Description time (ms) 0.12 
eo ae 0.12 
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4 Conclusion 


In this paper, we present a simple quantization-based description method for 
local image patch. More detailed, we propose two different ways to carry out 
quantization based on uniform thresholds on either gray scales or pixel quan- 
tities. Generally, descriptors using both quantization methods have very good 
time efficiency, and perform well on texture scene, illumination changes, com- 
pression distortions and image blurring. Specifically, the CDF-based descriptor 
requires more computation time yet yields better results. We think it is a fea- 
sible attempt to introduce quantization into fast descriptor building besides bit 
sampling and comparison of BRIEF-like descriptors and intensity permutation 
of LUCID-like descriptors. The quantization-based methods have similar advan- 
tages as permutation-based descriptors and consume as little computation power 
as binary ones. Our method can be applied as an alternative of SURF-like and 
BRIEF-like descriptor under the circumstance of real-time requirement, lim- 
ited computation power and minor image deformation. Future work may aim 
at developing the robustness to undergo greater deformation and warping, and 
binarized construction methods utilizing quantization is worthy to be discovered 
for better storage efficiency. 
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Abstract. With the development of the telecommunication network, more and 
more devices are used in the network, which has been a burden for the network 
operation and maintenance. At the same time, network devices generate large 
amounts of log data every day, recording the activities of each device in detail. 
As a result, the log can reflect the performance of network state, and sometimes, 
we can predict the occurrence of network failure based on the log. However, since 
the log has such features: big volume, multi-source heterogeneous and difficult 
to understand, people have not reasonably used it to analyze and predict network 
failure. Therefore, we propose a method for structuring a large number of device 
logs in the short term, and use the data generated from a real communication 
device network to verify the effect. Besides, we compare our method with the 
traditional log parsers, such as regular expressions, LogSig, etc. to demonstrate 
the efficient processing performance and accurate pattern extraction analysis for 
massive network device logs. 


Keywords: Big data - Log parser - Telecommunication network equipment 
Word2vec 


1 Introduction 


With the rapid development of telecommunications technology, the telecommunications 
network is more complex and the network business is more diverse. At the same time, 
the mining for a large amount data generated by network has also attracted the attention 
of many people. Network devices log contains a lot of information, which can represent 
the operating state and healthy degree of network. However, because of the volume and 
characteristic of the log data, the researchers have not achieved remarkable results. For 
example, all the equipment in an operator can produce about 2 TB log data in a province 
in one day, and these log are written by seven different vendors with different formats 
(Fig. 1). 

Obviously, without the instructions and the guidance of the professionals, raw log 
message produced by telecommunication network equipment, as shown in the following 
example, is difficult for the operator to understand the exact meaning of these logs, not 
to mention using it to carry out further work. 
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Jul 26 18:12:42: {6/LP}: %ASESDK-5-NOTICE: 35208 3 NOTICE 
sgwcd_SEOS _ssc:libsscdoperations.UpdateBearerOperation: , MmeTeid=1073584140, LCOR=0, 
Cause=10 (2)+ 

Jul 26 17:34:41: {8/LP}: %ASESDK-4-WARNING: gtpcd[7909]:gc-0/8/1:8242 <DAPP>: 
<11>: !!!WARN: Gx_WarnUnknownAvpReceived: GX: Received and ignored unknown AVP, Route- 
Record (code 282, vendor 0). (cmdCode=RAR(258), appld=PCC(16777238), 
hopByHopid=212199767, e ...« 


Fig. 1. Typical raw telecommunication device log data 


At present, there are several methods to deal with log, and methods based on the 
pattern are widely used in log analysis. In this method, one raw log message can be 
divided into two parts: the constant part and the variable part. For telecommunication 
network equipment log, the variable part contains a lot of valid information, such as the 
location of the module that issued the log, the actions performed by the operator, and 
the time of the log. However, when the volume of log increases to a certain extent, due 
to the huge variable data, despite using this method, the results will take up a lot of 
Storage space. 

Therefore, we investigated the various log preprocess methods and for the charac- 
teristics of the network device logs, we have taken certain ways based on the natural 
language model, making complex log becomes more suitable for storing and mining. In 
order to validate our method, we used it and several other typical log parser to compare 
the result in the test set of different kinds of logs and the real network device data set, 
which proves the accuracy and efficiency of our method. 


2 Log Parser 


2.1 Parser Methods 


There are three kinds of methods that are mainly used for log data parser. 


2.1.1 Methods Based on Regular Expression 

In the traditional log processing methods, regular expression is often used to extract a 
specific field. Many programming languages support the use of regular expressions for 
string manipulation. It can develop the structured data to process the log, so that a large 
number of non-structural or semi-structured information is discarded. And this kind of 
method is not flexible enough, basically for some specific log need to be processed. 


2.1.2 Methods Based on Pattern 

The log data is automatically generated by the program in the device, which is often 
composed by constant strings and variable parameters, so the log data has obvious semi- 
structured features. By generating and comparing the existing set of patterns, the words 
in the log are divided into log template words and variable words, so that we can find 
the abnormal parameters in the data set (Fig. 2). 


78 L. Li et al. 


Row Log Data 





2015-10-10 18:18:18 A down, port 8080 

2015-10-10 18:18:19 B down, port 8081 

2015-10-10 18:18:19 C down, port 8082 
2015-10-10 18:18:22 A restart, username =D, password=E 
2015-10-10 18:18:23 B restart, username =F, password=G 











Parser 


2015-10-10 18:18:18 pattern 1 
2015-10-10 18:18:19 pattern 1 
2015-10-10 18:18:19 pattern 1 
2015-10-10 18:18:22 pattern 2 
2015-10-10 18:18:23 pattern 2 


Event list Pattern list 


pattern 1: * down, port * 
pattern 2: * restart, username =*, password=* 

















Fig. 2. The object of methods based on pattern 


For example, in [5], the author proposes a STE method to judge whether it is a log 
template word or a variable word based on the location and length of the word in the 
log text, which can determine the log pattern. 

This is the usual method for log analysis, and its speed is faster and performance is 
better, but because the device log parameters varies and the data format is very irregular, 
leading to the poor quality and large redundancy for the pattern set, which has a great 
impact on the effect of the data mining work. 


2.1.3 Methods Based on Data Mining 
At present, many studies apply the data mining algorithm to the log process, for example, 
[6] applied the k-center clustering algorithm to analyze the ITS system log data to 
analyze the internal structure of the ITS system. In [7], by using the clustering algorithm, 
a log pattern recognition algorithm based on distributed platform is designed to improve 
the speed and efficiency of log recognition. 

However, at present there is not suitable data mining method for telecommunication 
network device log analysis. 


2.2 Three Typical Parsers 


In order to verify the performance of our parser, we chose three typical parsers to 
compare with ours. And these parser’s source code can be find in the [1]. 


2.2.1 LKE 
This method is proposed to parse free-form text log for anomaly detection in distributed 
systems, and it is made up by the following steps: (1) remove the parameters according 
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to the established rules (2) measure raw log similarity by the weighted edit distance (3) 
cluster similar raw log keys together (4) Log template generation [2]. 


2.2.2 IPLoM 

This method is proposed for automatic event log analysis, which includes three step 
hierarchical partitioning process: (1) Partition by log length. The first step is to use the 
token count heuristic to partition the log messages, because the log messages that have 
the same line format are likely to have the same token length (2) Partition by token 
position. By counting the words in the same position, the method will sort the log by 
the count. (3) Partition by search for mapping. By searching for mapping relationships 
between the set of unique tokens in two token positions, the log is divided into smaller 
partition. (4) Log template generation [3]. 


2.2.3 LogSig 

To understand and optimize system behaviors, LogSig is proposed to generate system 
events from textual log messages. LogSig works in three steps: (1) Word pair generation. 
Raw log data are converted to a set of word pairs to record both the word and its position. 
(2) Log Clustering. Based on the word pairs, a potential value is calculated for each log 
message to decide which cluster the log message potentially belongs to. (3) Log template 
generation. In each cluster, the log messages are leveraged to generate a log template 
[1, 4]. 


2.3 Our Parser 


Telecommunication device log data often have the following characteristics: (1) 
Complex parameters. Most of the parameters are numbers. (2) Short text. Most of logs 
are short sentences or parameters list, and the sentence are irregular. The longest 
sentence is no more than 30 words. (3) Difficult to understand. Without professional 
instruction book, it is difficult to identify the meaning of the log. 

According to the characteristics of the raw data, our analytical method has three 
steps. First, remove all the punctuations, numbers and the words containing numbers. 
Second, compute the Hash value of the processed text, to obtain unique Hash value of 
each log text. Third, by comparing the hash values, the log is merged into the same log 
pattern, and we can obtain the log pattern table and the simplified log event sequence. 
Fourth, use the edit distance to merge the log patterns again and rewrite the log event 
sequence. 


2.4 Parser Practice 


2.4.1 Inthe Five Kinds of Log Data Set 

We use five different log data, which come from different log systems (BGL, HPC, 
HDES, Zookeeper and Proxifier), each kind log data contains two thousand lines [1], 
and we used our parser and three other parsers to compare the result. In the experiment, 
we used the same environment and code language to come to the following results. 
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From the results we can see that for the small data set of log data, IPLoM algorithm 
show a great advantage in speed, our algorithm is running faster than the other two 
parsers. In terms of accuracy, our parser has an advantage in the analysis of certain logs 
(Figs. 3 and 4). 


myparser a 


IPLoM 





O 50 100 150 200 250 300 350 400 450 


Mzookeeper Mproxifier MHPC mBGL mSOSP 


Fig. 3. The running time of four parsers 





SOSP BGL HPC proxifier zookeeper 


== | ogSig = LKE =ġ=|PLoOoM =@= myparser 


Fig. 4. The accuracy of four parsers 
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2.4.2 In the Large-Scale Log Data Set 
We still experimented with actual data set on, because the actual data set is very large, 


we use thousandth log in one day, 750 M data, containing more than 7 million log data 
(Table 1). 


Table 1. The running time of four parsers on big data set 





Parser name Time(s) 
LogSig 38581 
LKE — 
IPLoM 376 

My parser 7673 


We listed the running time of each parser, where LKE was unable to perform log 
parsing due to memory overflow, and we could see that IPLoM still had a great advantage 
in speed, but when there was no standard regular expressions, its results contain a lot of 
redundancy. Our parser will give less redundant and more accurate results within a 
tolerable time range. 


2.4.3 Summary 

Comparing our performance with the other three classic parsers, we found that although 
our parser was not as fast as IPLoM, it showed good adaptability and speed when dealing 
with a large number of telecommunications device logs data. After that, we have carried 
out further data mining to understand the data and find the information in the log data. 


3 Log Analysis 


3.1 Background 


3.1.1 Word Vector 

To apply the machine learning method to the natural language processing field, the most 
basic problem is the representation of the language symbol. So far, the most commonly 
used method for natural language processing is One-hot Representation, which means 
that n words are n-dimensional vectors, each vector is 1 in a dimension and the other 
dimension is zero. However, this method will cause the lexical gap problem, there is no 
connection between words and words. 

Therefore, the Distributed Representation method is proposed, that is, using low- 
dimensional real vector to represent vocabulary. The biggest advantage of Distributed 
Representation is that it can make meaning-related words relatively similar in distance. 
At the same time, the word vector will show many special properties, as shown below 
(Fig. 5). 


V(King) — V(Queen) ~ V(man) — V(woman) 


Fig. 5. An application of word vector 
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3.1.2 Neural Network Language Model 

Bengio Yoshua proposed a neural network algorithm using a three-layer neural network 
to build the model, the purpose of the model is to predict the next word wt by former 
n — 1 words. At the bottom are the former n — 1 words (Wt — n + 1 ~ Wt — 1), and C(w) 
means the vector of the word W. The input layer of the network is to concatenate the 
n-1 vectors to a (n — 1) * m dimensions vector, which is labeled x. The second layer of 
the network is obtained directly by using d + Hx, where d is the offset term, and the 
initialization value is random, using tanh as the activation function (Fig. 6). 


i-th output = P(w; = i | context) 
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Fig. 6. Neural network structure proposed by Bengio Yoshua 


The third layer of the network is represented by the node Yi, using the softmax 
function to normalize the output value into probability, and the final Y is calculated as: 


Y = b + Wx + U * tanh(d + Hx) 


U is the parameter from the hidden layer to the output layer, the majority of the model 
compute operation is centered on the matrix multiplication of the U and hidden layers. 
Finally, we use the stochastic gradient descent method to optimize the model, then we 


can get word vector [8]. 
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3.1.3 Word2vec 

Word?2vec is a tool launched by Google to calculate words vector, which has been gained 
comprehensive attention because of its efficiency and convenience. It is based on the 
neural network and the natural language model. By the relationship between words and 
sentences, it can calculate a word vector for each word, and we can compare the word 
similarity by the distance between the word vectors. Based on the principle of word2vec, 
we have present a log mining method that can get literally similar logs or logs containing 
a log of important links. 


3.2 Use Word2vec in the Log Process 


3.2.1 The Method 

We use our parser to parse certain carrier’s seven-day telecommunication device log 
data. (1) Firstly, we parse the log data into a log pattern set and a log event data table. 
(2) Then, we list one-day log pattern number in order of their time sequence, as the 
word2vec sentence, and we see each log pattern as word in word2vec. Through the 
word2vec tool we calculate the word vector for each pattern. (3) Finally, we derive a 
similar pattern set by comparing the Euclidean distance between the word vectors and 
comparing the similarity between patterns. 


3.2.2 The Result 

We optimized the parameters of word2vec for structure log data. When using the Skip- 
Gram model, the vector dimension is 50 dimensions and the window size is 5, we find 
that the word vector is more accurate when looking for similar data patterns. At the same 
time, we choose the vectors whose cosine similarity is greater than 0.9 as a similar 
pattern. Finally, we get a number of similar patterns, and these similar patterns is of 
great significance in the problem analysis (Table 2 and Fig. 7). 


Table 2. The parameters of word2vec 


CBOW Iter 
o dso |s |o |1 00-04 |20 jo 100 





The user had a request. UserName IpAddress VpnInstanceName OM_VRF Request gprzfx Result 
The user succeeded in login. UserName gprzfx IpAddress VpnInstanceName OM_VRF 


The user left. UserName gprzfx IpAddress VpnInstanceName OM_VRF Reason user request to leave 


Record display command information. Task Ip VonName User AuthenticationMethod Local-user Command display interface Paif 


Record display command information. Task Ip VonName User AuthenticationMethod Local-user Command display cpu-usage all 
Record display command information. Task Ip VonName User AuthenticationMethod Local-user Command display ip pool 





Fig. 7. The similar log pattern produced by word2vec 


In the results, we can find that some similar log patterns produced by word2vec have 
literal similarity, which means that word2vec can help us to optimize the log parsing 
and find else log patterns need to be classified except which has the different digital 
parameters or whose edit distance is less than a certain value. At the same time, word2vec 
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can help us discover log patterns that are literally unrelated but have similar meanings 
or links, which is of great importance to subsequent log analysis. 


4 Conclusion 


Our algorithm uses the neural network language model for the first time to analyze the 
telecommunication equipment log, and obtains the similarity pattern. At the same time, 
we design the log analysis method to adapt to the log of the telecommunication equip- 
ment, and verify the effectiveness of the method by experiment. 

Through the experiment and the comparison of the results, we can find that our parser 
obtains a better analytical effect for the telecommunications server log data in the toler- 
able time. Subsequent analysis, whether using word2vec for similar patterns discovery, 
or the use of other data mining methods to explore, such as abnormal point recognition 
and correlation analysis, can be based on our processing results for analysis. 

However, our analytical methods are also deficient, for example, we can find patterns 
that have similar characteristics in the order of occurrence, but how these patterns are 
applied specifically to some telecommunications systems problem, such as system error 
prediction, auto log analysis system without expert, we also need to continue exploring 
and researching. 
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Abstract. Data matching and retrieval aims at finding out similar substrings with 
the pattern P in the given data set T. This problem has wide applications in big 
data analysis. A liberalized verification rule is proposed first, and then a similarity 
computing based order preserving matching method is presented. Theory analysis 
indicates our method runs in linear. Furthermore, the experimental results show 
that our method can improve effectively the precision ratio and the recall ratio. 
More qualified matching results can be detected compared with the state of the 
art of this problem. 


Keywords: Pattern recognition - Order-preserving matching - 
Similarity retrieval 


1 Introduction 


Fast and accurate data matching and retrieval is one of the key problems in big data 
applications such as video retrieval, stock analysis and prediction. Based on the context 
and application scenarios, the data objections can be abstracted into a series of vectors 
with different properties. Furthermore, the different properties can be integrated into a 
number through reduction or conversion. Consequently, data matching problem will be 
transformed into string or number matching problem which is one kind of well known 
problems in pattern recognition. Given a set of numbers T of length n and a pattern P of 


length m, both being numbers or strings over a finite alphabet >’, the task of string 
matching is to find all the substrings u in T which have the same relative order as P, and 
lul = IPI. For example, let P = (10, 22, 15, 30, 20, 18, 27) and T = (22, 85, 79, 24, 42, 
27, 62, 40, 32, 47, 69, 55, 25), then the relative order of P matches the substring u = (24, 
42, 27, 62, 40, 32, 47) of T [1]. 

Several online [3—7] and one offline solution [2] have been proposed for the string 
matching problem. Kim et al. [3] and Kubica et al. [4] presented solutions based on the 
Knuth-Morris-Pratt algorithm (KMP) [8], the KMP algorithm is mutated such that is 
determines if the text contains substring with the same relative order as that of the pattern 
using the order-borders table. Kim et al. [3] utilized the prefix representation method to 
find the rank of each number in the prefix, and this method was further optimized using 
the nearest neighbor representation to overcome the overhead involved in computing 
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the rank function. Later, Cho et al. [5, 6] gave a sublinear solution based on the bad 
character heuristic of the Boyer-Moore algorithm [9]. Almost at the same time, 
Belazzougui et al. [7] derived an optimal sublinear solution. The state of the art for this 
problem is the filtration method [1] which is presented by Chhabra et al. All the verifi- 
cation rules in previous researching demanded the values and the positions of the 
numbers in both T and P must be coherent strictly. So, some actual matching results 
may be discarded. Moreover, almost all the earlier researching were focused on the time 
complexity analysis while ignored the precision ratio analysis and recall ratio analysis. 
They didn’t research whether some actual matching results were missing despite the 
algorithm became more and more fast. 

In view of the above problem, a similarity computing based string or number order 
preserving matching method is presented. Based on the preprocessing, 1.e. binary trans- 
forming and filtration, our method proposes a novel verification rule which can guarantee 
more candidate results can be found out. Then, a similarity computing based sorting 
method is presented which can ensure all the matching results are listed according to 
the similarities with the pattern P. Theory analysis indicates our method is sublinear. 
The experimental results show that our method can improve effectively the precision 
ratio and the recall ratio compared with the newest method at present. 

This paper is organized as follows. Section 2 presents our solution. Section 2.4 
analyses our approach. The experimental results are given and discussed in Sect. 3. 
Section 4 concludes this paper. 


2 Our Solution 


2.1 Problem Description and Motivation 


The state of the art for order-preserving matching 1s the filtration method [1] which was 
presented by Chhabra et al. We call the filtration method as MT.C in this paper. The 
MT.C transform the original data T and the pattern P into binary string T’ and P’ 
according to formulation (1). Then searching for the substring with the same relative 
order with P in T can be transformed into searching P’ in the analogously T’. In the 
above example, P’ = 101001 and T’ = 100101001100. Each occurrence is a match 
candidate which is verified following the numerical order of the positions in the original 
pattern P. 


—jJlty <4 _ J l, Pia < Pi 


The MT.C method made the matching process simpler and more efficient than the 
earlier solutions. However, some deficiencies could be found because of the too harsh 
filtration and verification rules. The MT.C method required the order of the numbers is 
strictly coherent according to the values and the corresponding positions in P and T. For 
example, as shown in Fig. 1(a), the MT.C method can’t find the differences between T 
and P if the max-number increases or the mini-number decreases immensely in T. 
Moreover, as shown in Fig. 1(b), the MT.C method can’t find the differences either if 


More Efficient Filtration Method for Big Data 87 


certain subsection data in T jumps suddenly while the variation trend maintains as a 
whole. Furthermore, to find out the matching substrings from the candidate strings which 
were produced through filtration algorithm, the numbers of P must be sorted, and the 
verification processing demands the positions and the size relations between T and P 
must be coherent strictly. This rule would result in the looseness of some actual matching 
substrings. As shown in Fig. 1(c), according to the MT.C method, the gray node x in T 
can only change between dashed line a and dashed line b which represent the right 
neighbor y and the left neighbor z respectively in the sorting of T. Once the gray node 
x changes above the dashed line a or down the dashed line b, the MT.C method will 
regard that T is not matching with P. Then the corresponding substring will be discarded 
consequently. Actually, we can find that this is not the case especially when ly-zl is small 
enough. For example, when x changes to x’ which is slightly larger than y or changes 
to x” which is slightly smaller than z, the MT.C method will discard the substring. But 
we think the substring is still similar with P in most actual applications such as data 
retrieval. 


i O Data —O— Data -—O— Data 
\ 
I \ 
4 I —@— Pattern A —@— Pattern A —@— Pattern 
































(a) (b) (c) 


Fig. 1. Problem description 


2.2 Our Solution 


Problem definition of string or number matching in [1] is described as following: Two 


strings u = ulu2...um and v = vlv2...vm of the same length over >) are called order- 
isomorphic, written u & v, if formulation (2) holds. 


u; Su; Vv; <v for 1<i,j<m. (2) 


This rule demands all numbers in u and v are strictly coherent with both the sizes 
and their positions. So, some actual matching results would be discarded as shown in 
Fig. l(c). To overcome this deficiency, a different rule is presented as following. Two 


strings u = ulu2...um and v = vlv2...vm of the same length over $. The numbers of 
v = vlv2...vm are sorted firstly and the result is a sequential table r as formulation (3). 


lv <v, , l<i<j<m} (3) 


ra y 
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an (4) 


Two strings u and v are called order-isomorphic, written u % v, if formulation (4) 
holds. Apparently, multiple matching results would be found out according to formu- 
lation (4) once the pattern is given. But, the similar extent of the multiple matching 
results are different. To distinguish the similar level of multiple candidate results, a 
similarity computing method should be given based on the optimized verification 
method. Considering the possible concussion range of some numbers in T which have 
the same positions with relevant numbers in P. The similarity function f, as shown in 
Fig. 2, between the candidate substring u and the pattern P is defined as formulation (5). 


fu, P) = Pile) - gpl. 1<ism. (5) 





SX Diff. OD 


= Diff.5 








Fig. 2. Data reduction 


Where, ui and pi are the relevant numbers which have the same position order in the 
candidate substring u and the pattern P. m is the length of u and P. g is the reduction 
function, shown in formulation (6), which can reduce all the numbers in u and p into the 
same zone [0,1]. Min(u) and Min (P) are separately the Min-values in u and P. Max(u) 
and Max(P) are separately the Max-values in u and P. 


u; — Min(u)/Max(u) — Min(u) or p, — Min(P)/Max(P) — Min(P) (6) 


As shown in Fig. 2, the similarity between the candidate substring u and the pattern 
P can be computed easily as the sum of all the differences, represented by Diff.i 
(1 <1 <n), between the relevant nodes of u and P according to formulation (5). 


2.3 The Proposed Algorithm 


As described above, our solution includes four steps: binary transformation, filtration, 
verification and similarity computing. The binary transformation can be processed 
according to [1], and the filtration can be conducted using any exact string matching 
algorithm. Supposing the binary transformation and filtration have been finished, the 
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pseudo-code of our solution based on the optimization and the similarity computing can 
be described as Algorithm 1 and Algorithm 2. 





Algorithm 1: Computing the similarity 


1) X =AnticipationBinary (T;); 


2) X,=AnticipationBinary (P); 


3) For 1=0 to length; 


4) 


Sum+=Abs(X,[i] - X,[i]); 


5) Return Sum. 





Algorithm 2: AnticipationBinary is reduction function 


1) 


2) 


3) 


4) 


5) 


6) 


7) 


8) 


9) 


intput A; 


Max=A. Max; 


Min=A.Min; 


For i=0 to length 


If (Max==Min); 


B[i]=(Aļi]-Min)/(Max-Min); 


Return B. 
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2.4 Algorithm Analysis 


Our solution, represented as MS, include four steps: Binary Transformation (BT), filtra- 
tion, verification and Similarity Computing (SC). So, the time complexity of our solution 
O(MS) can be represented as following. 


O(M,) = O(BT) + O(filtration ) + O(verification) + O(SC) 


Compared with [1], the main differences of our solution are the verification and 
similarity computing. Furthermore, the main modification in verification is a liberalized 
verification rule is implemented and more matching results can be detected. So, the time 
complexity of verification doesn’t be changed and only the O(SC) need be analyzed. 

Supposing the numbers in P and T are integers and they are statistically independent 
of each other and the distribution of numbers is discrete uniform. Supposing LT = n, 
LP = m. In the worst case, the similarity computing process requires O(m(n-m)) simi- 
larity computing operations. In most cases, LT is usually very large while LP is usually 
a constant small enough. So, O(m(n-m)) © O(nm) on average which is equal with MT.C 
method. So, we can conclude O(MS) = O(MT.C) which are all sublinear. 


3 Experimental Results 


Our experiments used linear string matching algorithm KMP [8] as the filtration method. 
The tests were run on single node of Tianhe-2 super computer with configuration CPU 
E5-2692 v2 12*2 2.20 GHz. All the algorithms were implemented in C# in the 64-bit 
mode. 


3.1 Effectiveness 


To explain the superiority of our method, two special data sets are generated based on 
one basic data set. The basic data set was given as T = (10, 14, 12, 15, 13, 28, 36, 32, 
24, 38, 26) and P = (10, 14, 12, 15, 13, 19, 21, 20, 17, 22, 18). The first special data set 
shown in Fig. 3(a) was generated by multiplying n (1 < n < 100) to the last six numbers 
in T. The second special data set shown in Fig. 3(b) was generated by multiplying n 
(1 <n < 100) to only the 10th number in T. 
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Fig. 3. Data set samples 


From Fig. 3(a) and (b), we can find that the data T and the pattern P become more 
and more dissimilar with the increasing of n. But, the MT.C method couldn’t find the 
differences among them. Through computing the similarities between T and P according 
to the technologies presented in Sect. 2, the variation trend of the similarity between T 
and P with the increasing of n was shown in Fig. 4. According to the definition of 
similarity shown as formulation (5), the more bigger the value of the similarity is, the 
more dissimilar the substring is compared with P. We can find that the similarity 
increases with the increasing of n and becomes converging with the infinite increasing 
of n. The point of inflection means that the data set T couldn’t be considered similar 
with P according to our method. At the same time, we can find that the points of inflection 
are different with different data sets. So there is no stable point of inflection for different 
data sets in our method. The appearance of the point of inflection depend on the data set 
itself. 
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Fig. 4. Similarity convergence 
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Imitating the data generation method in [1], we generated two special data set and 
one random data set. The random data set contains 100000000 random integers between 
0 and 230. The lengths of patterns (LP) were picked as 10, 15 and 30. The experimental 
results showed that our method can find more matching substrings than the MT.C 
method. For example, when the pattern length was picked as 10, the MT.C method can 
only find 15 results. However, our method can find 135 results which were sorted 
according to the similarities. The first 10 most similar results were shown in Figs. 5, 6 
and 7 separately when the length of pattern were picked as 10, 15 and 30. The red line 
represents the pattern, and the blue lines represent the missing results using MT.C 
method while are detected using our solution. The black lines represent the matching 
results which can be detected using both MT.C method and our solution. 




























—— pattern 
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—#— Similarity 0.02231 
—— Similarity 0.02566 
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Fig. 6. The most similar 10 results (Lp = 15) 
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Fig. 7. The most similar 10 results (Lp = 30) 


3.2 Time Consuming 


Furthermore, to compare the time consuming between our method and the MT.C 
method. We enlarged the length of T from 10000000 to 49000000 gradually with an 
increment of 1000000. Ten experiments were conducted and the average time 
consuming was computed with the different lengths of patterns such as 5, 30 and 50. 
The statistical results were shown in Fig. 8. Because our method could find more results 
than the MT.C method, and more similarity computing operations were needed. So the 
overall executing time appear a small amount of growth on equal conditions. But, the 
executing time growth decreased quickly with the increasing of the length of the pattern. 
Because the number of matching results decreased quickly with the increasing of the 
length of the pattern. 
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— — — Tamanna's cakulate percentage of filtrate 
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Fig. 8. Comparison of average time consuming 
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4 Conclusion 


Fast and accurate data matching and retrieval is a key problem in big data applications 
such as video retrieval, stock analysis and bioinformatics etc. A similarity computing 
based string or number order preserving matching method is presented. Theory analysis 
indicates our method is sublinear. Furthermore, the experimental results show that our 
method can improve effectively the precision ratio and the recall ratio compared with 
the newest method at present under the same conditions. Compared with former research 
works for order-preserving matching problem, our solution liberalized the verification 
rules on certain extent and so more qualified matching results can be detected and found 
out. 
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Abstract. Blind image quality assessment (BIQA) assesses the perceptual 
quality of the distorted image without any information about its original reference 
image. Features, in consistent with human visual system (HVS), have been proved 
effective for BIQA. Motivated by this, we propose a novel general purpose BIQA 
approach. Firstly, considering that HVS is sensitive to image texture and edge, 
the image gradient and wavelet decomposition is computed. Secondly, taking the 
direction sensitivity of HVS into account, the gray level co-occurrence matrixes 
(GLCMs) are calculated in two directions at four scales on the computed feature 
maps, 1.e., gradient and wavelet decomposition maps, as well as the image itself. 
Then, four features are extracted for each of GLCM matrix. Finally, a regression 
model is established to map image features to subjective opinion scores. Extensive 
experiments are conducted on LIVE I, TID2013 and CSIQ databases, and show 
that the proposed method is superior to the state-of-the-art BIQA methods and 
comparable to SSIM and PSNR. 


Keywords: Blind image quality assessment (BIQA) 
Gray level co-occurrence matrix (GLCM) - Human visual system (HVS) 
Image structure 


1 Introduction 


At present, digital images, as the carrier of massive information, have greatly enriched 
people’s life as well as drastically facilitated the communication among people [1, 2]. 
Yet image distortion remains a stubborn problem in image transmission system. There- 
fore, it is indispensable to establish efficient methods for image quality assessment 
(IQA). 

Generally, IQA method can be split into two major categories: subjective and objec- 
tive assessment methods. Currently, objective IQA algorithm has been widely studied 
because it 1s easy to implement and portability. Given the available information of the 
pristine image, objective assessment method can be further classified into full-reference 
IQA (FR-IQA), reduced-reference IQA (RR-IQA) and no-reference IQA (NR-IQA). 
Since both FR-IQA and RR-IQA methods use information of the original reference 
image, so they are limited to special situations. In this paper, we mainly focus on the 
NR-IQA method. 

At present, NR-IQA method can be broadly divided into two classes, 1.e., training- 
based opinion-aware metric and opinion-unaware metric. The former one requires a 
training process to create a regression model for predicting image quality. For example, 
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Moorthy and Bovik [3] proposed a two-step framework that called BIQI. Specifically, 
each distortion type was trained with a regression model. In such case, the distortion 
type of image can be obtained through these models. Subsequently, the image statistical 
properties are gradually applied into IQA and have been proved effectively. For instance, 
Saad et al. [4] provided a NR-IQA algorithm under the hypothesis that the statistics 
features of discrete cosine transform (DCT) coefficients change regularly along with 
image quality. Although these methods have achieved meaningful performance, they 
require training procedure. To tackle the problem, metrics, which don’t require human 
opinion scores and any regression model, have been proposed. Xue et al. [5] used a set 
of cluster centroid with quality label as a codebook to predict image quality, called QAC. 
Natural image quality evaluator (NIQE) [6] established a completely blind BIQA metric 
by fitting the quality-aware features to a multivariate Gaussian (MVG) model. Although 
the training process is not required, their performances need to be further improved. In 
this paper, we propose a new blind image quality assessment method based on training. 

It should be mentioned that the above methods mainly rely on mathematical statistics 
method but without full consideration of the HVS characteristics. GLCM can effectively 
describe image feature by measuring statistical characteristic of image in multi-direction 
and multi-scale. By considering characteristics of HVS and the variety of computing 
method for GLCM, this paper presents a simple yet effective BIQA metric. Figure | 
shows the pipeline of our method. It can be divided into the following three parts: calcu- 
lation of GLCM, feature extraction and image quality prediction. 


calculation of GLCM image quality prediction 
Gradient 
image 
G Ke feature extraction 


Distorted Features of | ane 
3LC SVR 
image LANA | GLCM | | 


Wavelet 
decomposition 




















Quality score 





Fig. 1. The pipeline of the proposed method 


2 The Proposed Method 


2.1 Feature Map Extraction 


(1) Gradient map and wavelet transform 


Given a color image, firstly, it is transformed into grayscale, which is denoted by 
I(x, y). The direction templates in the horizontal and vertical directions are denoted by 
T, and T, 


T,= [-1 0 1] (1) 


Blind Image Quality Assessment via Analysis of GLCM 97 


m 
=t, (2) 
where ‘7’ denotes transpose. 

Then, the gradient components in the horizontal and vertical directions, denoted by 
G, and G,, are computed as: 


G,=T,*1 (3) 
es (4) 
where **’ denotes convolution. Finally, the gradient map G is calculated as: 


_ [G+ |G, 


5 (5) 


Wavelet transform decomposes image into multi-scale and multi-direction. The 
image is usually transformed along horizontal, vertical and diagonal directions. And 
then the decomposition sub-graphs in those three directions are usually denoted by HL, 
LH and HH, respectively [15]. In this paper, the wavelet decomposition scale is set to 
1, which gets good results. 

As image distortion always induces the structural degradation, we desire to evaluate 
image quality by utilizing image structure information. Image gradient and decompo- 
sition sub-graphs are complementary to each other in representing rich image structure. 
On the one hand, image gradient describes the global image structure while misses 
orientation information. On the other hand, wavelet decomposition reflects image 
features in different orientation, while ignores global structure. Hence, their combination 
ensures integrity of the image structure information. 


(2) GLCM matrix calculation 


Usually, image distortion brings about a significant change of image statistic char- 
acteristics. GLCM can provide image statistic characteristics in different directions and 
at different scales in spatial domain, so it can describe image characteristics from various 
aspects. Based on this, in this paper, the GLCM matrixes of the above image structure 
maps are calculated. 

The GLCM is composed of the joint probability density between image gray tones. 
There are three important parameters in GLCM: angle (@), quantized gray tones (L) and 
distance (d). Firstly, the image is quantized to L gray tones. Then, the probability of 
occurrences of the pair of gray tones i and j in original image is expressed in P(i, j, d, 
0 (i =1,2,...,L,7 = 1, 2, ..., L). Each entry (i, j) is depart at a distance d in angle 0. 
Finally, the GLCM can be denoted as [P(i, j, d, 0™)];2, where P(i, j, d, 0) is the element 
of [P(i, j, d, O)]zx; 1n the i-th row and j-th column. 


2.2 Feature Extraction 


In [7], fourteen features were extracted from GLCM to represent image properties from 
multiple perspectives. Currently, researchers usually used part of them in view of the 
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redundancy among them [8]. In this paper, we employ four commonly used features, 
namely contrast, energy, correlation, and homogeneity, to extract quality sensitive 
features for IQA. Those four selected features involve local and global image charac- 
teristics. Among them, contrast and energy describe the overall characteristics of the 
image. Specifically, contrast describes image definition. Energy reflects the image 
distribution as well as roughness. On the contrary, correlation and homogeneity are local 
image descriptors. Concretely, correlation illustrates the local correlation of image 
grayscale. Homogeneity measures local change of image grayscale. Overall, the selected 
features can reflect both local and global features of image, to a certain extent. Therefore, 
they can be applied into IQA problem. 

Although we have demonstrated the feasibility of GLCM in IQA problem, how to 
choose the parameters, 1.e., 0, L, d, is still a thorny problem. Research shows that HVS 
is more sensitive to the horizontal and vertical image information than the oblique 
direction [9]. Moreover, different viewing distances produce various perception for 
HVS. HVS focuses on outline of image at large viewing distance, while at small distance, 
it will pay attention to image details [10]. And for GLCM, small scale in GLCM can 
describe characteristics of fine image structure, while large scale obtains characteristics 
of rough image structure. Inspired by these, we extract GLCM in multi-direction and 
multi-scale. Specifically, ĝis set as0 and 90 to highlight the sensitive direction of HVS, 
the distance d is set as 1, 2, 4, and 8 for simulating the variation of viewing distance, 
and L is set as 8. Since distortion also corrupts the brightness information, to avoid its 
loss, we also extract the above features on distorted image. Overall, the GLCM for 
gradient image, decomposed high-frequency sub-images (HL, LH and HH after one 
scale wavelet decomposition) and distorted image is calculated in two directions (0 and 
90 ) at 4 scales, resulting in eight GLCM matrices for each calculated image. A total of 
40 GLCM matrices are attained, followed by four features extraction for each GLCM 
matrices. 


2.3 Image Quality Assessment 


After the feature extraction, the realization of image quality assessment is based on a 
regression model. Specifically, the train samples is denoted as T = {(F), Dı), (Fr, D2), 
4 (FE, Di), ..., (Fm D,,)}, where i is the index of the train images, F; E R” represents 
image feature vectors and D; denotes image opinion scores. The array T is trained to 
learn a model. Then, the obtained regression model can be used to predict image quality. 
Its mapping function can be abbreviated as D, = model (F,), where Fis the feature vector 
of the test image and D, is the predicted quality score. In our metric, we employ support 
vector regression (SVR) to evaluate image quality. The LIBSVM toolbox is utilized to 
implement Epsilon-SVR with kernel of radial basis function [11]. 


Blind Image Quality Assessment via Analysis of GLCM 99 


3 Experiment Results and Analysis 


3.1 Experiment Setup 


The proposed method is tested on three public databases: LIVE II [12], TID2013 [13] 
and CSIQ [14] database. In LIVE II database, we test the proposed algorithm on all of 
the five distortion types, i.e., JPEG2000 compression (JP2K), JPEG compression 
(JPEG), white noise (WN), Gaussian blur (Gblur) and transmission errors in the JP2K 
using Fast-fading Rayleigh channel model (FF). In TID2013 and CSIQ databases, four 
distortion types are tested, namely JP2K, JPEG, WN and Gblur. Three general IQA 
criteria, 1.e., Spearman rank order correlation coefficient (SROCC), Pearson linear 
correlation coefficient (PLCC) and root-mean-squared error (RMSE), are employed for 
performance evaluation. A better performance means a value close to | for PLCC and 
SROCC while a value close to 0 for RMSE. 

In order to verify the effectiveness of the proposed method, we select two public FR 
algorithms (SSIM [15] and PSNR) and several mainstream NR methods (QAC [5], BIQI 
[3], ILNIQE [14], GM-LOG [16] and YCLTYCbCr [17]) for comparison. For ROI- 
BRISQUE, because the source code is not obtained, we directly use the experiment 
results on LIVE II database provided in the original paper for comparison. Since the 
proposed method is based on training, we divide the image set into two non-overlapping 
image sets: training set and testing set. The training set contains 80% of the reference 
images and corresponding distortion versions of them, and the testing set is comprised 
by the residual images. After the random train-test split is repeated 1000 times, the 
median performance is taken as the final results. 


3.2 Experiment Results 


Table 1 shows the performance tested on the entire database. For better observation, the 
top three performed algorithms are highlighted in bold. As we can see, the performance 
of the proposed method always lies in top three. In fact, compared with those IQA 
methods in Table 1, our method achieves the best performance in all three databases. 


Table 1. SROCC, PLCC and RMSE (median value across 1000 train-test trials) of SSIM, PSNR, 
QAC, BIQI, NIQE, ILNIQE, GM-LOG and YCLT-YCbCr on the overall database of LIVE II, 
TID2013 and CSIQ respectively. 


Database | Metric SSIM PSNR QAC BIQI IL-NIQE | GM- YCLT- | Pro. 
LOG YCbCr 


LIVE I 0.9581 
0.9348 | 0.9524 
6.6445 
CSIQ 0.8980 0.9482 
0.8869 | 0.9432 
0.0890 
TID2013 0.8789 0.9512 
0.8690 | 0.9377 
0.4293 
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4 Conclusion 


In this paper, we propose a simple yet efficient blind IQA metric based on GLCM statistic 
model of image structure. To verify the performance of our proposed method, we 
conducted a set of experiments on LIVE II, CSIQ and TID2013 databases. We apply it 
on the entire database, and the experimental results demonstrate that our predicted scores 
is more accuracy than two public FR-IQA algorithms and the mainstream NR-IQA 
methods. In summary, we can draw the conclusion that the proposed method obtains 
excellent performance in BIQA. 
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Abstract. In this paper an improved ranked K-medoids algorithm by a specific 
cell-like P system is proposed which extends the application of membrane 
computing. First, we use the maximum distance method to choose the initial 
clustering medoids, maximum distance method which is based on the fact that 
the farthest initial medoids were the least likely assigned in the same cluster. 
And then, we realize this algorithm by a specific P system. P system is adequate 
to solve clustering problem for its high parallelism and lower computational 
time complexity. By computation of the designed system, one possible clus- 
tering result is obtained in a non-deterministic and maximal parallel way. 
Through example verification, our algorithm can improve the quality of 
clustering. 


Keywords: Ranked K-medoids - Maximum distance - P system clustering 


1 Introduction 


Clustering is a rapidly developing area which contributes to research field including 
data mining, machine learning, spatial database technology, biology and marketing and 
so on [1]. Clustering analysis is the process of dividing a set of objects into 
none-overlapping subsets [1]. 

Up to now, many kinds of approaches of clustering has appeared, for instance 
hierarchical [2], partitioning [3], density-based [4], model-based [5] and grid-based [6]. 
As a partitioning clustering algorithm, ranked K-medoids algorithm has strong 
robustness and high accuracy in contrast to traditional K-medoids algorithm [2]. In this 
algorithm K medoids are selected randomly, in this way, two or more medoids will be 
assigned to one cluster easily. When this phenomenon appeared, one of them was left 
behind, and others should be relocated. 

As a new branch of natural computing, membrane computing is a cross-discipline 
topic incorporating computer science, biology, artificial intelligence and so on. It has 
the advantage of parallelism so it can lessen the time complexity and improve the 
process speed of massive data sets [8, 9]. 

In this paper, a new version of ranked k-mediods algorithms is proposed in order to 
escaping from local optimum. When select the initial medoids we use the method 
called “Maximum Distance Method”, by this method we select the accuracy medoids at 
the beginning. To realize this algorithm we designed a specific cell-like P system. 
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The rest of the paper is organized as follows. Section 2 presents a variant of the 
ranked k-medoids algorithm. Section 3 design a specific P system to realize the 
algorithm we proposed. Finally, Sect. 4 makes conclusions. 


2 The Improved Ranked K-medoids Clustering Based 
on Maximum Distance Method 


2.1 Maximum Distance Method 


In ranked K-medoids clustering the initial medoids are selected randomly, when more 
than one medoids are assigned to the same cluster one of them is left and others 
relocated. It wastes a lot of time, to solve this problem we use the method of maximum 
distance. 

Maximum distance method based the fact as follows: (1) objects who are far from 
each other is less likely to assigned to the same cluster. Based on this fact, firstly we 
calculated the distance between every two objects, and two objects who has the 
maximum distance are chosen as the initial clustering center. In the remained (N-2) 
sample points, select the object as the third cluster center when the product of the 
distance to the first two initial centers is maximum. The fourth cluster center is selected 
like before, and so on,we can find the k initial cluster centers. 


2.2 Ranked K-medoids Based on Maximum Distance Method 


In this paper, a new ranked K-medoids clustering algorithm based on maximum dis- 
tance method is prosed 


1. Calculate the similarities among pairs of objects based on the similarity metric. 

2. Calculate R matrix by sorting the similarity values and store the indexes of similar 

objects from the most similar to the least similar in sorted index matrix. 

. Select k medoids use the maximum distance method. 

4. Select the group of the most similar objects to each medoid, using sorted index 
matrix (The number of members of the group is determined by an input 
parameter m). 

5. Calculate the hostility values of every object in those groups 

6. Choose object with the maximum hostility value as the new medoid. 

7. Assign each object to the most similar medoid. 


W 


3 P System for the Improved Ranked K-medoids Algorithm 


3.1 Construction of the Specific P System 


i= (O,Mo, Mı, Mo, T Mx, ch, co, p) 
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The P system we construct include membrane Mo, membrane M,(1 < t < k), 
Woo is the initial multiset that include in the membrane Mo, and a, a2 - - an represent 
the objects in a given data set. Ro represents the rules in membrane Mo (Fig. 1). 


3.2 Experiment and Results 


In order to give a better interpretation of our P system model for the improved ranked 
K-medoids clustering, we take an example to simulate the procedure of the P system. 
There are 12 points (Fig. 2): 





Fig. 2. The initial state of the points 


The P system is supposed to distribute the points into 3 clusters with the given 
parameter m of the value 3. Diagram 2 depicts the original state of the 13 points. 

First, calculate the distance between every two points, then find points (1, 1), (5, 7), 
(8, 2) that satisfy maximum distance method. After that, the rules in three membranes 
are executed at the same time and find the final clustering medoids by hostility values. 
Finally, sent the objects to the nearest clusters. 

Eventually, the clustering result was attained that the 13 points were classified into 
3 clusters. And the clustering effect sketch is shown in Fig. 3. 


Fig. 3. The final clustering result 
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4 Conclusions 


Ranked K-medoids is a novel partitioning clustering algorithm, it propose a novel new 
term named hostility value. The medoid is update in each iteration finally it will find the 
center of the clusters, objects will be assigned to the nearest cluster. 

This paper present a new ranked K-medoids algorithm that based on the maximum 
distance method, by this method we can find the suitable initial medoids. By this mean 
we don’t need to relocate the medoids that are assigned to the same cluster, and profit 
from characteristic great parallelism of P system, so that the time of iteration is shorten. 
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Abstract. Supporting group Undo/Redo with high performance in a 3D 
designing environment still remains a challenge. The question is how to recover 
the document state as if an operation is never executed. In this paper, we are going 
to propose an inverse-operation based group Undo/Redo solution for feature- 
based 3D collaborative CAD systems. It allows a 3D part to be manipulated 
locally. We developed an ontology in the CAD domain so as to describe common 
elements in typical feature-based CAD systems. By classifying features according 
to how they affect a volume, feature modeling operations are categorized into four 
groups. We are able to create inverse operations for operations belonging to the 
same category. The proposed methods are tested in a prototype system with case 
study. 


Keywords: Feature-based collaborative CAD - Group Undo/Redo 
Inverse operation - Ontology 


1 Introduction 


Feature-based modeling approaches have played a relevant role for qualitative knowl- 
edge specification and integration in collaborative designing since the 70s [1-3], and 
feature-based CAD systems are currently considered the state-of-art technologies for 
product modeling [4-7]. Undo/Redo is a standard and core function for nearly all human- 
computer interaction applications. It can help to remove some erroneous operations and 
enable a user to explore a new system by try-and-failure. Most importantly, undo/redo 
is an essential tool to grant that a document is the result of designers’ intentions by 
removing the effects of undesired operations. An Undo/Redo mechanism should under- 
stand which operation is to be undone when a user issues an undo command by different 
interaction methods. Besides, the undo effect demands that the result of a certain oper- 
ation is eliminated as if it has never been executed. This leaves us a question that how 
the effect of an undo target be removed from the 3D model. Our previous study inves- 
tigate both full-rerun and full checkpoint strategy[8, 9], although the correctness of our 
group Undo/Redo algorithms can be promised, the performance is still not satisfying. 
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Treating Undo as an inverse operation is a prevailing strategy for collaborative 
document and graphic editing. The inverse operation is executed on the current docu- 
ment directly. The document does not need to be rolled back and forth which is consid- 
ered to be more efficient than the previously mentioned two undo strategies. This mech- 
anism depends largely on two conditions. One is that there should be quite a limited 
number of primitive operations. For example, in collaborative document editing, there 
are simply two primitive operations, say insert and delete, which are regarded as the 
inverse operation of each other [10]. The other one is that an object from the document 
should be easily located and distinguished from other objects. A feature-based CAD 
application retains a large quantity of feature modeling operations. Generating an inverse 
operation for every feature modeling operation still remains a challenge. 

In this paper, we are going to propose an inverse-operation based group undo/redo 
approach to serve the purpose of both intention preservation and high performance. The 
basic idea is to generate inverse operations by building a domain ontology and classi- 
fying the modeling operations into quite a limited number of groups. An Inverse oper- 
ation 1s generated in a uniformed way for an operation category but not for any specific 
operation. The remaining of this paper is organized as follows. Section 2 reviews the 
related works from the perspective of current group undo methods and ontologies build 
in the CAD domains. In Sect. 3, the ontology constructed in the CAD domain is intro- 
duced. Section 4 provides a detailed description of how to create inverse operations 
different feature groups. Section 5 includes the experiments to illustrate the our group 
Undo/Redo approach based on the inverse-operation model. Conclusions and limitations 
are given. 


2 Related Works 


Several undo models are proposed for multi-user Undo/Redo. Each model sets a frame- 
work for developing specific undo and redo algorithms. The script model [11] 1s one of 
the earliest group Undo/Redo solutions. Regional undo was also applied in spreadsheets 
[12] by allowing users to select a part of the spreadsheet and perform undo on the cells. 
Treating an undo as a concurrent inverse operation is another popular solution for group 
undo. For the first time, Berlage introduced the selective undo model that adds the 
reverse operation of the selected command to the current context [13]. In collaborative 
textual editors, delete and insert can be the inverse operation for each other. DistEdit 
[14] is the first OT based selective undo solution. The undo target is transposed to the 
end of the history buffer and the inverse command of the transposed operation is 
executed. The adopted [15, 16] also considered an inverse as the concurrent operation 
of operations generated after O. Neither DisEdit nor adOPTed supports the full featured 
selective undo. Sun has proposed the framework of AnyUndo to support “Anytime, 
Anywhere undo”. The proposed algorithms GOTO and COT both integrate do and 
selective undo [17, 18]. Azurite [ 19, 20], provides general selective undo in a code editor. 
Azurite uses the inverse model, by adding the inverse text editing operation to the end 
of the history, and provides a variety of user interfaces to resolve conflicts. Logoot-Undo 
[21] is the group undo algorithm supporting AnyUndo under the CRDT framework. 
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The document is manipulated at the line-level. The effects of that an inserted line is 
undone is actually controlled by changing the visibility of the line. Conflicts between 
two concurrent undos are therefore avoided because if an undo cannot find a line given 
its unique identifier, it can therefore be discarded. 

Applying the inverse model to graphical editors seems easier than to painting 
systems. On one hand, in an object-oriented system, the inverse operation can be easily 
generated by retrieving the designing history, and then executing some system defined 
operation by using the properties of the graphical object before the latest modification 
[13, 22]. In pixel-based drawing systems, such as Photoshop, it is the historical infor- 
mation of every pixel that needs to be recorded which is resource consuming. In Wang’ s 
solution for undo in bitmap systems [23], only pixels of the drawing area and connections 
between operations are recorded. Aquamarine [24, 25] is a system developed on the 
basis of Photoshop for selective undo using the script model. Another undo scheme 
developed under the script model is the group Undo/Redo method for 3D collaborative 
CAD systems developed by our previous research [25]. 


3 Ontology-Based Inverse Operation Generation 


We build an ontology, called Part Feature Ontology (PFO), to describe the semantics of 
a part from the perspectives of individual settings and parameter values. We also intend 
to use this ontology to categorize features in a 3D collaborative CAD system. The classes 
defined in PFO are: modeling feature, topology info, B-rep Operation, Reference Info 
and Sketch Info. Six relations are defined, which are: is-a, has-topology-attribute, has- 
reference-attribute, has-boolean-operation, as-dependency-attribute, and has-sketch. 
We categorize features according to their effects on the volume. The subclasses of 
modeling feature include convex feature, concave feature, edge transition feature and 
primitive feature. 


3.1 Inverse Operation for Convex Features 


The inverse operation of a convex feature creating operation is to remove any topological 
entities that extrude outward the datum plane without considering the shape of the 
feature. The construction process of the inverse operation is presented in Algorithm 1. 
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Algorithm 1 The inverse operation for convex feature creating operations 


Function Inverse(O) 
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1: Obtain the topological entities that belong to the feature created by O. Typically, these 
topological entities are protrusive above the original part model. This is accomplished by 
parsing the semantic descriptor OWL file FDo with OWL API parser so as to obtain the 


attribute TopologicalFaceSet; 


2: For every topological face f from the TopologicalFaceSet, execute the delete(f) so as to 


remove the revealed faces from the part model; 


3: Detect the broken area on the datum plane, create a face in the same shape ,and merge 


the face with the plane. 


4: return ; 


Figure 1 gives us an example. Three feature instances are added to the same base 
block feature by different modeling methods. O; adds the convex feature by sweeping 
the face f, along a curve. O, adds the convex feature by rotating. O; adds the feature by 
extruding. The features are in different forms. It is worth noting that, in some situations, 
when deleting some revealed faces from the boundary, there leaves a broken area on the 
datum plane. In this case, the plane should be repaired. As it is illustrated in Fig. 2, when 
the revealed faces from the round protrusion are deleted, there leaves a circular broken 
area on the datum plane where the protrusion located. 
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Fig. 1. Inverse operations for different convex 
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Fig. 2. An example of the broken area 
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3.2 Inverse Operation for Concave Feature Creating Operations 


The inverse operation of a concave feature creating operation is to re-fill in any hollow 
space it creates on the part model. Inspired by the definition of the bounding box widely 
used in 3D modeling, we intend to find the Minimum Wrapping Solid, noted as MWS 
for brevity, for a hollow space of any form. An MWS is actually a solid with the smallest 
volume which contains all the points from the depression feature. Then, by uniting the 
MWS with the part model, the hollow space is re-filled. The formal definition for the 
Minimum Wrapping Solid is given in the following. 

In Fig. 3, we give several examples of the MWSs for different types of depression 
features with different forms. O, creates a cubic slot. O, creates a round hole. O; creates 
a sink hole. For the sake of creating an MWS, geometry-related information, such as the 
radius and length of a simple hole, the width, length and height of a block, is more critical 
than topology-related information. 


base feature: 





creating minimum wrapping solids 





<r D before undo before undo 





after undo after undo 
The undo process for | The undo process for 
__the concave edge | | the convex edge _ 


Fig.3. Inverse operations for differentconcave Fig. 4. Inverse operations for different edge 
feature instances transition feature instances 
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Algorithm 2 The inverse operation for concave feature creating operations 
Function Inverse(O) 


1: Parse the FDo and extract the information, namely, the category the feature instance 
belongs to, the dimension parameters and the reference information. 


2: The MWS for the concave feature is constructed. It is then transformed to the proper 
position by using the same reference information extracted in step 1; 


3: Unite the MWS with the current part model so as to refill the hollow space that O 
created; 


4: Delete faces from the MWS that are extruding outward the part and repair the broken 
area; 


5: return ; 


3.3 Inverse Operation for Edge Transition Feature Creating Operations 


We devise the inverse operations by considering how an edge transition is created. As 
it is illustrated in Fig. 4, an edge transition is created in two different situations: (1) On 
the left side of the Fig., a concave edge is chosen for edge transition which appears to 
difference a small volume from the part; (2) On the right side of the Fig., a convex edge 
is chosen and the result appears to add a volume so the two orthogonal faces connect 
smoothly. The procedure of creating inverse operations for different edge transition 
occasions is given in Algorithm 3. 


Algorithm 3 The inverse operation for edge transition feature creating operations 
Function Inverse(O) 


1: Parse the FDo and extract the information, namely, the category the feature instance 
belongs to, the dimension parameters. 


2: The MWS for the transition edge is constructed. It is then transformed to the proper 
position; 


3: If O selects a convex edge for edge transition, the MWS is united with the part. if O 
selects a concave edge for edge transition, the MWS is subtracted from the part; 


4: return ; 
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4 Experiments 


In this example, three geographically dispersed sites build and modify models with both 
do and undo operations. How the boundary of the part evolves with the continuous 
execution of both do and undo at each site is given in Fig. 5 and the expected effect of 
every operation is described in details in Table 1. Whenever an operation is performed, 
the related hybrid feature descriptor is generated. Observed from Fig. 5, the collaboration 
activity involves three stages. The operations issued in the first stage are pure modeling 
operations. The second stage contains concurrent do and undo. The third stage is a typical 
scenario of concurrent undo. The emphasis will be laid on the group undo process. 


CATLA Ad 








undo; undo» 2 null 
" =) 8 S % y OF R 2 = 
O; undo; null undo, undo; 


site2 J J 9 à : E. = 
O; O4 O» 


undo; undo; undo» null 
Fig. 5. A typical scenario of the collaborative modeling process 


When undo is issued by site1, its intention is to delete the round hole feature created 
by O; from sitel. It is carried out immediately at sitel. The undo command is sent to 
other sites in the form of UNDO((1, 1), (1, 3)). Being an instance of a concave feature, 
the MWS for the round hole is created given the length of the hole and the radius of the 
circular end-face. The MWS is then transformed given the locating information 
containing in the reference information descriptor of FDos. By uniting it with the part, 
O; is successfully undone. When undol is received by a remote site, O; is then identified 
with the assistance of the tuple (1, 3). The undo process is carried in the same process 
with how the undo is performed at the local site site. 

Undo2 and undo3 are concurrently generated at siteQ and site2. However, they are 
not in a conflictive relation. Undo2’s intention is to delete the round protrusion created 
by O; from sitel. It is carried out immediately at siteO. The undo command is sent to 
other sites in the form of UNDO((0, 1), (1, 1)). The revealed faces from the round 
protrusion are deleted. Then, the broken area left on the base feature is repaired. The 
topology information are required are from FDo3. When undo? is received by a remote 
site, O; is then picked out with the assistance of the tuple (1,1). 
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Table 1. Operation descriptions 


Feature modeling Operation effect 

operation 

O; Create the base cylinder feature 

O, Create another cylinder with the same radius on top of the base feature 
O; Create a round protrusion across the base cylinder 

O, Create a round hole within the base cylinder 

O; Create a round hole within the round protrusion built by O; 

O6 Create a round cylinder crossing the hole generated by O3. The end of 


the cylinder is decorated by spiral line. 

O; A mirror operation is invoked to generate another three holes around 
the round protrusion generated by O3. Since now the four holes are 
symmetric with the center of the round protrusion. Then we can say O7 
depends on O,. 


The intention of undo3 is to delete the small round hole within the base cylinder 
created by O; from site1. It is carried out immediately at site2 and sent to other sites in 
the form of UNDO((2, 1),(2, 1)). The MWS for this round hole is created given its 
length and the radius. It is then transformed given the same reference information 
acquired from FDo.. By uniting it with the part, O; is successfully undone. O; is undone 
as well because of its dependency relation with O4. 


5 Conclusions 


In this paper, we propose a novel inverse-operation based group Undo/Redo algorithm 
for feature-based 3D collaborative CAD systems. This work is the first to apply the 
inverse model to the domain of 3D environment. We built an ontology in the CAD 
domain and developed classes with sub-classes to describe general entities in the feature- 
based 3D CAD environment. Using these concepts and their relations, we generate a 
hybrid descriptor to record the information of a feature. For a convex feature, we delete 
the topological entities it creates and repair the broken area after its deletion. For a 
concave feature of any shape, we re-fill the hollow space by creating the minimum 
wrapping solid and uniting it with the part. This group Undo/Redo solution gives an 
opportunity to modify a part locally. Our future work is to deal with the attribute 
dependency and attribute conflicts among features. 
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Abstract. Phase noise caused by the local oscillator affects both the 
cancelation of self-interference and the reception of the desired signal. In 
this paper, a novel two steps method exploiting polarization is proposed 
to resist the effect of phase noise on the self-interference cancelation in 
full-duplex. This method can convert the multiplicative noise to additive 
noise with polarization signal processing in two steps, which can cancel 
the effect of phase noise on both the self-interference and desired signal. 
It should be noted that this method needs no prior information of the 
phase noise. Theory analysis and numerical simulations show that the 
effect of phase noise is reduced and the amount of cancelation is also 
evaluated in condition of different phase noise values. 


Keywords: Full duplex - Self-interference cancelation - Phase noise 
Polarization 


1 Introduction 


Full-duplex is a modality of communication that allows a node to transmit and 
receive signal simultaneously at the same frequency band. It has higher spec- 
tral efficiency, higher throughput, and lower transmission delay than the tradi- 
tional half duplex system, which satisfies the requirements of 5G very well [1-5]. 
However, massive self-interference (SI) caused by the local transmitter coupling 
into the local receiver is an inevitable problem to resolve. Self-interference trans- 
mitted through a short path is 15-100dB stronger than the desired signal [6]. 
Therefore the cancelation of self-interference is one of the key factors influencing 
the full-duplex communication. Moreover, it has been demonstrated that the 
phase noise is one of the bottleneck of the self-interference cancelation [7]. 
Currently, relevant scholars put forward many self-interference cancelation 
methods at the receiver, which can be classified into three categories, antenna 
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cancelation [2,5,9], radio frequency cancelation (RFC) [1,3,6,8], and digital can- 
celation [10-13]. Antenna cancelation and radio frequency cancelation is on the 
premise of ignoring the phase noise, then the self-interference can be suppressed 
to 10-15 dB stronger than the desired signal. Digital cancelation utilizes one pilot 
siganl to track the phase noise and uses compensation structure to improve the 
amount of self-interference cancelation. The effect of phase noise caused by sep- 
arate local oscillator or the same local oscillator on self-interference cancelation 
is analyzed [10]. Inter carrier interference (ICI) and common phase error (CPE) 
caused by the phase noise is handled through estimating and compensating the 
phase noise, the amount of cancelation can be improved 7—10dB in [11,12]. The 
closed-form solution between the amount of cancelation and 3dB bandwidth of 
phase noise is derived [13]. However, the existing methods dealing with phase 
noise only focus on self-interference signal but ignore the effect of phase noise 
on the desired signal. In this paper, a novel polarization self-interference can- 
celation method is proposed to cancel the effect of phase noise both on the 
self-interference signal and the desired signal without any prior information of 
the phase noise. 

The polarization state (PS) of the signal is a substantive character which can 
be used to carry information such as time, frequency, space, and code character 
[14]. Polarization modulation technology have been proposed, and the oblique 
projection filtering is used to distinguish the desired signal from the noise [15]. 
The PS of the signal is decided by the amplitude ratio and phase difference. 
Taking advantage of the PS definition, the basic idea of this paper is that, phase 
noise does not affect the phase difference between the two branches but only 
affect the absolute phase, thus the phase noise has no effect on the PS of the 
signal. We have proved the theoretical feasibility in the previous work [16], and 
we will further consider the effect of phase noise on both self-interference and 
desired signal in this paper. 

In this paper, the PS of the signal can be used to cancel the effect of the 
phase noise on the self-interference and desired signals, since it is not affected 
by the phase noise. The algorithm can be divided into two steps by converting 
the multiplicative noise to additive noise with polarization signal processing. 
The first step is converting the effect of phase noise on the self-interference 
signal to the desired signal and white noise, and using the reconstructed signal 
feedback from transmitter to cancel the self-interference signal. The second step 
is converting the effect of phase noise on the desired signal to white noise, then 
recovering the desired signal. Both of the two steps does not need the prior 
information of the phase noise. This method uses the polarization matching to 
receive the desired signal and improve the amount of self-interference cancelation. 
Theory analysis and numerical simulation show the effect of phase noise on self- 
interference cancelation. The amount of cancelation can be improved by 0-10 dB 
in condition of the different phase noise values. This paper mainly analyzes the 
effect of phase noise on the self-interference cancelation, other parameters like 
the nonlinearity of power amplifier of transmitter and receiver, I/Q (in phase and 
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quadrature components) imbalance, and the quantization noise of ADC (analog 
to digital converter) are ignored. 

The rest of the paper is organized as follows. The full-duplex communication 
system model and signal model of polarization is presented in Sect. 2. In Sect. 3, 
the self-interference cancellation of two steps against phase noise based on polar- 
ization signal processing is proposed. Simulation results are presented in Sect. 4 
to demonstrate the effectiveness of the proposed scheme. Finally, conclusions are 
captured in Sect. 5. 











2 System and Signal Model 


2.1 System Model 


Full-duplex communication link is established between two nodes as shown in 
Fig. 1. A significant difference between the polarization full-duplex model and the 
traditional full-duplex communication model is introducing a Polarization Con- 
trol Module (PCM) at the transmitter. After Code Modulation Module (CMM), 
the signal enters into the PCM. PCM is composed of Power Division Unit (PDU) 
and Phase Shifting Unit (PSU), amplitude ratio can be controlled by PDU and 
phase difference can be controlled by PSU. The signal can obtain a specific PS 
after PCM, then enters into the D/A conversion module. The up-converted signal, 
which is affected by phase noise, enters into the orthogonal dual polarized antenna. 
The orthogonal polarization antenna is adopted at receiver, thus the mixed sig- 
nal is affected by phase noise when down-converted at the receiver. The effect of 
phase noise will be analyzed in following signal model. After the self-interference 
cancelation, the mixed signal enters into the Demodulation Module (DM). 
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Fig. 1. Full-duplex communication system model of polarization 


2.2 Signal Model 


Jones vector is used to express polarization state of the signal, the polarization 
state of the desired signal and self-interference are P, € C?*! and P; € C?*! 
respectively, and the desired signal Są and self-interference I, are 








St = Pss(t) = [cos(e,), sin(e, )e7* ]* s(t), (1) 
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I, = Pyi(t) = [cos(e;) sin(e;)e?* |’ i(t), (2) 


where £, and £; express the polarization angles, 6, and 0; express the phase angles 
of the desired and self-interference signals respectively. s(t) and i(t) express the 
desired signal and self-interference in time domains. 








s(t) = alti teits), (3) 
i(t) = b(t)eF Wet eibie(t) (4) 


where a(t) and b(t) represent the amplitude of the desired and self-interference 
signals respectively, and we is the carrier frequency. ¢@s:(t) and Qit(t) represent 
the phase noise respectively, then the desired and self-interference signals in 
polarization domain are 


(Wet+tdst(t Wet 
3, — a cos(e, ei (wettest (4) | _ Pes | onto, (5) 











t) sin(e, Jef Wet tH Pot (6) 4s) ed (wet t6a) 
b(t) cos(e; Je Wott oie) E,yeret dit (t 
noe K sin(e;)e? Wetten (t)+:) = E,,,e5 wet t5) eI Pit( i (6) 


where Eps = a(t) cos(eé,) and Eys = a(t) sin(e,) represent the H and V branch of 
the desired signal, Epi = b(t) cos(e;) and Evi = b(t) sin(e;) represent the H and 
V branch of self-interference, suppose the channel satisfies Gauss distribution 
and the noise is additive white Gaussian noise N(t), and the mixed signal at the 
receiver. 





y(t) = S, (t) T: I,.(t) g N(t), (7) 


S,(t) and I,(t) represents the desired and the interference signals at the 
receiver. After been down-converted, the equivalent signal at baseband can be 
written as 





yL(t) = S(t) 4 I(t)? ®© +Nz(t), (8) 


where Ø, (t) represents the phase noise caused by local oscillator at the receiver, 
Nz (t) expresses the white noise at baseband. Then Eq. (8) can be written as 


yi(t) = i Ms, ol (Pit tH) +4r(t)) 


vie? KA 
Ens | iieaeoe Nart) 
t lle Í t [NO] " 


where the Naz (t) and Noz (t) express the H and V branch of Nz (t), and the two 
branches satisfies Gaussian distribution of random variable, Naz (t) and Nyz(t) 
are independent and identical with mean 0 and variance is o? / PA 


3 Two Steps Method of Self-interference Cancelation 


The signal enters into the orthogonal dual-polarization antenna through the 
AWGN channel. Then the signal passed from the mixer down-converts to the 
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base band, which is affected by phase noise. The traditional algorithm is esti- 
mation and compensation [11-13], thus the estimation error of phase noise will 
affect the cancelation of self-interference. The proposed elimination algorithm 
in domain of polarization is shown in Fig.2. This method can be achieved by 
converting the multiplicative noise to additive noise with polarization signal pro- 
cessing in two steps. The first step is converting the effect of phase noise on the 
self-interference to the desired signal and the white noise using stokes vector pro- 
cessing in polarization domain, then using the reconstructed signal to cancel the 
self-interference signal. The second step is recovering the desired signal using the 
same principle. Then the mixed signal enters into the Match Receiving module 
(MR), at last it enters into the Demodulation Module (DM). 
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Fig. 2. The self-interference elimination algorithm 


3.1 Eliminating the Effect of Phase Noise on Self-interference 


After the mixed signal being down-converted, the mixed signal is expressed as 
Eq. (9), in which the | Naz (t) NaO“ can be written as Eq. (10). 


Ped E bkd (10) 


Nok (t) Nye F Jos 


The nae and nps express the in-phase and quadrature components of Naz (t), 
the Nyc and nys has the similar expression. In order to analyze conveniently, the 
signal transformed through stokes, as expression (11). 


Sy = PAON = uv Ol 
S> = 2 lurr ®)| luv Œ) oelvat) — Pury (t) (11) 
S3 = 2 [yra (t)| yuv (4) sin(yr u(t) — Pyry ¢)); 


where S1, S2, S3 express the three components of stokes. yzg (t) and yzy(t) 
express the H and V branch of y(t). Through (9) and (11), yL (t) expressed as 
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Sy = (Efi = Ega) + Ee ad B) ag (nj, ae Nis) ~ (n? T i ) 
Ie 2(EniEns cos(dsz(t) a dit(t)) — Evi bys cos(Pst(t )— pilt ) + (0s = d 
49 g cos(Pst(t) — bie(t)) + Eni 
Ens sin(Pse(t) — dit(t)) 
O +4000) inet) + (0) 
o sin (diz (¢) ar r(t )) COS (dit (t) TF br (t)) 
E ie cos(ds + dse(t) — Pit (t)) + Evi cos(d; d 
Eog sin(d, si Pst(t )- ditt )) + Eng sin(d; 
ake 
t)) | [mus |? 


) 
| | cos(die(t) + br(t)) sin(die(t) + 7 
So= Eps ee cos(ds) + Ens Ervi cos(@st(t) — Ditlt) — 0; ) 


fied (12) 


— sin(Pit(t) + @,(t)) cos(dit(t) + or 
thd ys cos(dsz(t) = Dit (t) T Oe.) T Enio cos(6;)) aE Nhclve A Nnastys 
49 Eys COS(st(t) — pilt) + ôs) + Ey; cos(d;) a 
Ess Sin(dse(t) — pult) + 65) + Evi sin(6 


i) 
_ | cos(dit(t) + Or(t)) sin(Gie(t) + or(t)) | | Mhe 
Snell) + GO) cD + (0) [me (13) 
+9 g cos(Pst(t) — bie(t)) + Eni 
Ens Sin( bse (t) — Pit (t)) 
| | cos(Pit(t) + Ør(t)) sin(@ie(t) + | bead 
_ sin(dit(t) T br(t )) cos( diz (t ) ae brit )) Nus f 
S3 = Ere Evs sin(— Os ) + gas a sin(dg¢(t )— Ditlt) — Ôi) 
SEn Ey sin(@gz(t )- dit(t ) + ôs) — Err Eng sinó) ag Nhsluc — Nhecnlus 
49 ex sin(dse(t) — pilt) + ôs) — Evi sin(d;) 
Eys COS(Pst(t) — bit(t) + 6s) + Evi cos(di) 
. | cos(Piz(t) + @r(t)) sin(Piz(t) + -(t)) head (14) 
_ sin( dit (t ) -+ b(t ) ~ l T b(t )) hs 
E S sin(dse(t )— 
aie osha t) — Pilt) ) + “nn 
f | cos (Piz (t )+ Or(E )) sin(die(t ) + brit “rl bel 
— sin(die(t) + br(t)) cos(ie(t) + Or(t)) | [nvs] 
while 
Mei} _ | Cos(Pie(t) + Or(t)) sin(Pit(t) +r) | | | ne 
bee : < sin(Øi (t) + br (E)) coslØi lt) + a pa e 
Nuci | _ | Cos(ie(t) + br(t)) sin(dit(t) + prt) | | nc 
aed = bs sin(dix(t) + br (E)) coslØi lt) + any Pel » e 


T T , . 
[Naci ns! | and eet iret express the H and V branch of white noise 
after the first rotation. According to the rotation characteristics unitary matrix 
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[14], [Naci Nhesl l * and ben ea ie have the same distribution with laine nhs)” 


and age Rus] then the y(t) can be calculated as 


_ | Eni Ens | 5(ber(t)—di(t)) | Nari) 
yi(t) = Paes T PA e F Naad’ (17) 


the Npri(t) = Nhe + jnnsi, Norilt) = nver + jnvs1- Comparing Eqs. (9) and 
(17), the phase noise of self-interference is converted to the desired signal and 
white noise, and the distribution of white noise has not changed. In the first 
rotation, it is not simply multiply the e~/(¢*+r(™) both sides of the Eq. (9). 
because the dilt), s(t), r(t) is unknown. It just uses the character of phase 
noise in frequency domain. Therefore it is unnecessary to estimate phase noise 
in this paper, which is the strong aspect compared with the method in time and 
frequency domain. Using the canceling signal y.1(t) from transmitter through 
wired feedback, the residual signal is Eq. (18). 





YiR(t) = yL (t) — yei(t) 


Ehi Ens | ilat) -00 4 | Nart) | | Em 
Pen 7 his | e u a E, ejò 
_ | Ens | oildae()—die(t)) 4. | Nart) 
E PA Í = Nyilt) ? 


(18) 
where the yLr(t) represents the residual signal cancelled by the first step. From 
Eq. (18), the self-interference is cancelled. However, the phase noise is converted 
to the desired signal, which will be solved in B. 





3.2 Eliminating the Effect of Phase Noise on Desired Signal 


Suppose: 


a = [Sneath ou) tt) B00] Lama] 


rca] = [antl coat) — ou) ea 


| Mhe2 nee] and | Mye2 nasl express the H and V branch of white noise 
after the second rotation. Using the same method as the first step, the Eq. (18) 
can be written as Eq. (21). 
_ | Ens Nnva(t) 
yin(t) E PA = Pees a 


In Eq. (21); the Nprolt) = Nhe2 + JNhs2, Novza(t) = Nyc2 9 Nay From the 
Eq. (21), the effect of phase noise on the desired signal is converted to white noise, 
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and the distribution of white noise has not changed. Then using the minimum 
variance criterion, we can get the polarization state of the desired signal. Then 
the polarization of the desired signal can be written as Eq. (22). 


P, = [cos(e,) sin(e,)e*** ]*, (22) 


A 
P, is the polarization of the desired signal estimated at the receiver. After the 
(MR), the mixed signal changed as Eq. (23). 


AN 
yLRı(t) = Psi -yur(t) 


= [ores | {| sss | + [Mma] 
) 


Es + [cos(e,) sin(e,)e~?* | | 


= ~j6s 
= |cos(€,) sin(e,)e 2° | E,, 616 


= s(t) + no (t) , 


Nnr2(t) | 
NyL2 (t) 


(23) 
A 
In Eq. (23), the yuri(t) is the mixed signal after MR, Ps1 is the Hermitian 
A 
Transpose of Ps, no (t) is the white noise after MR. 


Nnza2(t) | 
Ny (t) 


After the MR, the desired signal s(t) has been recovered. From the Eq. (23), the 
phase noise caused by the local oscillator of transmitter and receiver is canceled. 
The white noise is full polarization after the MR, thus the power of white noise 
keeps only in half. In the meanwhile, the amount of self-interference cancelation 
is calculated as Eq. (25). 


No (t) = [cos(e,) sin(e, )e~* | | (24) 








(25) 


SINRou 
NSINR, (dB) = 10log | | , 


SINRin 





nNsınr (dB) expresses the amount of self-interference cancelation. SINR;n and 
SINR out express the ratio between the desired signal and self-interference add 
white noise before and after signal processing at the receiver. Then the Eq. (25) 
can be written as Eq. (26). 








NSINR (dB) = 10log 2 -+ b’ (t) j : (26) 


g2 





Comparing the Eq. (26) and the amount of self-interference cancelation in 
[7,9,10,12], the phase noise caused by the local oscillator of transmitter and the 
receiver is canceled, and the amount of self-interference cancelation is improved. 





4 Simulation Scenarios, Results and Analysis 


Co-simulation between MATLAB and Advanced Design System (ADS) is used 
to verify the performance of self-interference cancelation method proposed in this 
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paper. This part shows the effect of phase noise on the signal and the process of 
how to cancel the effect of the phase noise intuitionally. 

In this section, the ratio between H and V branch is set as 1, the phase 
difference is set as 45 in degree. The variance of phase noise is set as 0.018 in 
radian. The effect of phase noise on the self-interference is shown in Figs. 3 and 4. 
The spectrum of the self-interference without phase noise is shown in Fig. 3, 
then Fig.4 shows the spectrum of self-interference effected by phase noise. By 
comparing Figs.3 and 4, we can obtain that the spectrum of self-interference 
has been spreaded due to the phase noise. The center power value of the self- 
interference decreased. The total power of the signal unchanged at the same time. 
The effect of phase noise on the desired signal is the same as the self-interference. 

The waveform of the mixed signal at the receiver in time domain is shown in 
Fig. 5(a). The self-interference is 90 dB higher than the desired signal. Because of 
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Fig. 3. The spectrum of self-interference without phase noise 
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Fig. 4. The spectrum of self-interference effected by phase noise 
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the effect of the self-interference, phase noise and white noise, the desired signal 
has been distorted seriously. After the first step cancelation, the self-interference 
is canceled, whose phase noise is converted to the desired signal and white noise 
as shown in Fig. 5(b). After cancelation by the second step, the phase noise of 
the desired signal is converted to the white noise as shown in Fig. 5(c). After the 
two steps, the effect of phase noise on self-interference cancelation is canceled, 
and the desired signal is recovered. 








feal(N) 





(b) (c) 


Fig. 5. Waveform of the mixed signal before cancelation (a), after the first cancelation 
(b), after the second cancelation (c) 
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Fig. 6. The cancelation affected by phase noise (Color figure online) 





Figure 6 shows the effect of phase noise on the self-interference cancelation. 
In Fig.6, the abscissa expresses the standard deviation of phase noise, ordi- 
nate expresses the amount of cancelation in dB. The signal interference noise 
ratio (SINR) is set as 90dB. The blue (rhombus) line expresses the relationship 
between the amount of self-interference cancelation and the standard deviation 
of phase noise without the method of suppression [7], the amount of cancelation 
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decreased as the power of phase noise increased. The green (rectangle) and pink 
(pentagram) lines express the amount of cancelation affected by phase noise after 
traditional self-interference elimination algorithm [11,12]. Although the amount 
of cancelation increased under the condition without suppression, it decreased 
when the phase noise increases. The red (triangle) line expresses the amount of 
self-interference cancelation using the method proposed in this paper. The black 
(roundness) line expresses the amount of self-interference cancelation simulated 
in ADS. These two lines suggest that by utilizing the cancelation algorithm pro- 
posed in this paper, the amount of cancelation remained unchanged as the phase 
noise increased. We can conclude that the effect of phase noise on the cancelation 
of self-interference has been canceled by using the two steps method algorithm 
proposed in this paper. 




















5 Conclusion 


Based on the unitary matrix rotation characteristics in polarization domain, the 
method in this paper has canceled the effect of phase noise on the self-interference 
cancelation and recovering of the desired signal. This method can be achieved 
by converting the multiplicative noise to additive noise with polarization signal 
processing in two steps. The first step is converting the effect of phase noise on 
the self-interference to the desired signal and the white noise using stokes vector 
processing in polarization domain, then using the reconstructed signal to cancel 
the self-interference signal. The second step is recovering the desired signal using 
the same principle. Theory analysis and numerical simulation show that effect 
of phase noise on self-interference cancelation is canceled, and the amount of 
cancelation is improved on the premise of recovering the desired signal. 
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Abstract. In this paper, we propose an approach of course recommender system 
for the subject of information management speciality in China. We collect the 
data relative to the course enrollment for specific set of students. The sparse linear 
method (SLIM) is introduced in our approach to generate the top-N recommen- 
dations of courses for students. Furthermore, the Lọ regularization terms were 
presented in our proposed optimization method based on the observation of the 
entries in recommendation system matrix. Expert knowledge based comparing 
experiments between state-of-the-art methods and our method are conducted to 
evaluate the performance of our method. Experimental results show that our 
proposed method outperforms state-of-the-art methods both in accuracy and effi- 
ciency. 


Keywords: Course recommender system - Sparse linear method 
Expert knowledge 


1 Introduction 


The emergence and rapid development of Internet have greatly affected the traditional 
viewpoint on choosing courses by providing detailed course information. As the number 
of courses conforming to the students’ has tremendously increased, the above-mentioned 
problem has become how to determine the courses mostly suitable for the students 
accurately and efficiently. A plethora of methods and algorithms [2, 3, 11, 15] for course 
recommendation have been proposed to deal with this problem. Most of the methods 
designed for recommendation system can be grouped into three categories, including 
collaborative [1, 8], content-based [7, 14], and knowledge-based [5, 8, 17], which have 
been applied in different fields such as [4] proposed a collaborative filtering embedded 
with an artificial immune system to the course recommendation for college students. 
The rating from professor was exploited as ground truth to examine the results. 
Inspired by the idea form [4] and the optimization framework in [9], we propose a 
sparse linear based method for top-N course recommendation with expert knowledge 
as the ground truth. This method extracts the coefficient matrix for the courses in the 
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recommender system from the student/course matrix by solving a regularized optimi- 
zation problem. The sparseness is exploited to represent the sparse characteristics of 
recommendation coefficient matrix. Sparse linear method (SLIM) [9] was proposed to 
top-N recommender systems, which is rarely exploited in course recommender systems. 
Due to the characteristics of course recommendation system in Chinese University, our 
method focuses on the accuracy more than the efficiency. It is different form the previ- 
ously proposed SLIM based methods [6, 9, 10, 18], which mainly addresses the real- 
time applications of top-N recommender systems. The framework of our proposed 
course recommender system is shown in Fig. 1. 


Learning Management 
System 


Select relevant data 





Data Gathering Data Pre-Processing 


Use SLIM 


Result of the 
Recommender System 


Fig. 1. The framework of our proposed course recommender system 


According to our observation about common recommendation system matrix, most 
of the entries are assigned the same value (zero or one), and the gradients of neighboring 
entries also hold the same value (zero or one). Therefore, the sparse counting strategy 
of Lọ regularization terms [16] were included into the optimization framework of SLIM. 
The Lọ terms can globally constrain the non-zero values of entries and the gradients in 
the recommendation system matrix, which is the main contribution of our proposed 
method. Different from the previously proposed regularization terms (the L, and L, 
terms), the Lọ term can maintain the subtle relationship between the entries in recom- 
mendation system matrix. 

After the process of data gathering as shown in Fig. 1, comparing experiments 
between state-of-the-art methods and our method are conducted. Consequently, both the 
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experimental results of state-of-the-art methods and our method are evaluated with the 
course recommendations presented by seven experts with voting strategy. 

The rest of the paper is organized as follows. In Sect. 2, we describe the details of 
our proposed method. In Sect. 3 the dataset that we used in our experiments and the 
experimental results are presented. In Sect. 4 the discussion and conclusion are given. 


2 Our Method 


2.1 The Formation of the Method 


In the following content, t; and s; are introduced to denote each course and each student 
in course recommender system, respectively. The whole student-course taken will be 
represented by a matrix A of size m x n, in which the entry is 1 or 0 (1 denotes that the 
student has taken the course, 0 vice versa). 

In this paper, we introduce a Sparse Linear Method (SLIM) to implement top-N 
course recommendation. In this approach, the score of course recommendation on each 
un-taken student/course item t; of a student s; is computed as a sparse aggregation of 
items that have been taken by s; which is shown in Eq. (1). 


eet A 
a; = a. W, (1) 
where a is the initial course selection of a specific student and w, is the sparse vector of 


aggregation coefficients. The model of SLIM with matrix is represented as: 


A = AW (2) 


Where overlineA is the initial value of student/course matrix, A denotes the latent 
binary student-course item matrix, W denotes the n Xn sparse matrix of aggregation 
coefficients, in which j — th column corresponds to w; as in Eq. (1), and each row of 
C(c;) is the course recommendation scores on all courses for student s;. The final course 
recommendation result of each student is completed through sorting the non-taken 
courses in decreasing order, and the top-N courses in the sequences are recommended. 

In our method, the initial student/course matrix is extracted from the learning 
management system of a specific University in China. With the extracted student/course 
matrix of size m X n, the sparse matrix W size of n x nin Eq. (2) is iteratively optimized 
by alternate minimization method. Different from the objective function previously 
proposed in [9] shown in Eq. (3), our proposed method is shown in Eq. (4). 


4 p 
min =||A —AW|; + SW + [|All (3) 
min +||A — aw]? +2 wip + A,||Al|> + uIVAI (4) 
W20 2 2 2 F 2o TH 2 


Where ||. ||- denotes the Frobenius norm for matrix, ||W ||; is the item-wise L; norm, 
||W ||, denotes the entry-wise Lọ norm that stands for the number of entries with zero 
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value. The data term ||A — AW]] is exploited to measure the difference between the 
calculated model and the training dataset. The L, — norm, L; — norm, and Lọ — norm 
are exploited to regularize the entries of the coefficient matrix W, A, and VA, respectively. 
The parameters J}, p>, A,, and u are used to constrain the weights of regularization terms 
in the objective functions. 

In our proposed final objective function, the Lyp norm is introduced to transfer the 
optimization problem into elastic net problem [19], which prevents the potential over 
fitting. Moreover, the L; norm in Eq. (3) is changed to Lp norm in our proposed objective 
function. This novel norm Lo [12, 13, 16] is introduced to constrain the sparseness of 
the A and VA. 

Due to the independency of the columns in matrix W, the final objective function in 
Eq. (4) is decoupled into a set of objective functions as follows: 


. 1 2 P 2 
i500 0 alla 7 awl, ra l; + alal, > Val, (3) 
where a; is the j-th column of matrix A, w, denotes j-th column of matrix W. As there 
are two unknown variables in each Eq. (5), which is a typical ill-posed problem. Thus, 
this problem need to be solved by alternate minimization method. In each iteration, one 
of the two variables is fixed and the other variable is optimized. 


2.2 The Solver of Our Proposed Method 


Subproblem1: computing w, 
The w, computation sub-problem is represented by the minimization of Eq. (6): 


1 2 bay j 
zlo- zel © 

Through eliminating the Lọ terms in Eq. (5), the function Eq. (6) has a global 
minimum, which can be computed by gradient descent. The analytical solution to 


Eq. (6) is shown in Eq. (7): 


F(a; 
C » 
+ > (F(A) -F(0,) +E) > F(4)) 
where F(.) and F '!(-) denotes the Fast Fourier Transform (FFT) and reverse FFT, 


respectively .F()" is the complex conjugate of F(-). 


Sub-problem 2: computing a, and Va, 


With the intermediate outcome of w, the a, and Va, can be computed by Eq. (8): 
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1 2 
zle am, + All, + aM 8) 
By introducing two auxiliary variables h and v corresponding to the column vector 


a; and Va,. The sub-problem can be transformed into Eq. (9): 


Sle- alfi + Ala; = all, + a| |Va; — vf, + ACh + lvl) 6) 


To testify the performance of our proposed method, comparing experiments between 
state-of-the-art methods and our method are carried out with gathered dataset and expert 
knowledge. In the following section, the experiments are described in detail. 


3 Experimental Results 


3.1 Datasets 


In order to testify the performance of our proposed method and implement the method 
in practical scenarios, we gather the data from five classes of information management 
specialty for the learning management system of our University. The data records of the 


Table 1. The initial dataset from the five classes 
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courses and students were extracted from the Department of Management Information 
System, Shandong University of Finance and Economics and the Department of Elec- 
tronic Engineering Information Technology at Shandong University of Sci&Tech. The 
most important information of the courses and students is mainly about the grades 
corresponding to the courses. All of the students from the information management 
specialty are freshmen in our University. Most of them have taken the courses of the 
first year in their curriculum except three students have failed to go up to the next grade. 
Thus, firstly we eliminate the records of the three students. Meanwhile, we collect the 
knowledge including the programming skill that they have mastered through a ques- 
tionnaire. The courses that they have taken and the content that have grasped are 
combined in the final datset. A part of the dataset is shown in Table 1, where 1 denotes 
that the s; student has mastered the t; course, and 0 denotes the opposite. 

After gathering the data of the students from the five classes, comparing experiments 
between state-of-the-art methods and our method are conducted. We choose several 
state-of-the-art methods including collaborative filtering methods itermkNN, userkNN, 
and the matrix factorization methods PureSVD. 


3.2 Measurement 


The knowledge from several experts on the courses in information management 
specialty are adopted as ground truth in the experimental process. To measure the 
performance of the comparing methods, we introduce the Hit Rate (HR) and the Average 
Reciprocal Hit-Rank (ARHR) in the experiments, which are defined as shown in 
Eqs. (11) and (12). 


#hits 


= ———— 11 
#students (11) 


where #hits denotes the number of students whose course in the testing set 1s recom- 
mended by the expert, too. #students denotes the number of all students in the dataset. 
#hits 


l l 


ARHR = ——— ) — 
#students 4 p, 


(12) 
Where p, is the ordered recommendation list. 


3.3 Experimental Results 


In this section, the experimental results calculated from the practical dataset. Table 2 
shows the experimental results of the comparing methods in top-N course recommen- 
dation. 
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Table 2. The performance of the comparing methods 


N 
b 
D 


SLIM 0.15 
ours 0.15 





Where HR;, ARHR; denotes the performance for class; respectively. The experi- 
mental results shown in Table 2 demonstrate that our proposed method outperforms 
state-of-the-art methods in most of course recommendations both in the HR and ARHR. 
It shows that the sparse regularization term based on the prior knowledge from the 
observation in our method are suitable for solving the problem of course recommenda- 


tion. 


In order to illustrate the performance of our proposed method according to the 
number of courses and topics included in the experimental testing. It shows in Fig. 2 
that a higher accuracy is obtained when the number of courses increases. Meanwhile, 
the courses included in our experiments are divided into 32 different topics, Fig. 3 shows 


that the accuracy is also higher when there are more relative courses. 
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Fig. 2. Accuracy of our proposed recommendation system method due to the number of courses 
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Fig. 3. Accuracy of our proposed recommendation system method due to the number of topics 


4 Conclusion 


In this paper, we propose an approach of course recommendation. In our method, the 
SLIM was introduced and a novel Lọ regularization term was exploited in SLIM. Mean- 
while, the alternate minimization strategy is exploited to optimize the outcome of our 
method. To testify the performance of our method, comparing experiments on students 
from five different classes between state-of-the-art methods and our method are 
conducted. The experimental results show that our method outperforms the other previ- 
ously proposed methods. 

The proposed method was be mainly used to implement the course recommendation 
for the Universities in China. However, it also can b exploited in other relative fields. 
In the future, more applications of our approach would be investigated. Other future 
work includes the modification of the objective function in our method including the 
other regularization terms and different optimization strategy. 
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Abstract. This paper studies a Container Trucks (CTs) dynamic dispatching 
strategy in which multiple tasks are matched with multiple CTs and proposes 
Multi-Agent contract net protocols based on two-way negotiation mechanism. It 
could assign optimal tasks to CTs through tendering, bidding and contracting of 
tasks. Fuzzy set theory and method is adopted to couple multiple factors on the 
tasks to assess to their dispatching emergency degree in the process. One of the 
factors is the distance from the current position of CT to the loading/unloading 
position. Through judging obstruction status by using related data of the GPRS 
system, the steps of ant colony algorithm are designed to find the predicted travel 
distance of the optimal route. After fuzzification and defuzzification in MATLAB, 
the decision query table of dispatching plan is obtained. Finally, a case study is 
given to describe the scheduling scheme in detail. 


Keywords: Container truck dynamic scheduling - Multi- agent system 
Fuzzy control - Ant colony algorithm - Path optimization 


1 Introduction 


The loading/unloading equipment at most container terminals in China are mainly Quay 
Cranes (QCs), Yard Cranes (YCs) and CTs. The scheduling system of CTs is well known 
to possess complex logistics system characteristics for there are many factors effecting 
scheduling, such as time, space, resources, and uncertain factors. Optimize the sched- 
uling system is meaningful, which can improve the efficiency of loading and unloading. 

The physical entity resource network and the control decision-making information 
network were integrated into the modeling and optimization architecture using Harvard 
architecture and Agent based computing [1, 2]. In [3], Relationship between transport 
tasks and service of CTS has been taken as a contract net using the fuzzy set theory and 
method. The dispatching model based on Contract Network Protocol (CNP) using bidir- 
ectional negotiation is provided and fuzzy reasoning process of dispatching decisions is 
suggested. But the distance from the current position of CT to the loading/unloading 
position is calculated by the established coordinate system. In the case of road conges- 
tion, CT should choose a smooth route, so the distance should be the length of the route. 
Recently, the path planning problem is a hot research area [4, 5], the main algorithms 
used are genetic algorithm, ant colony algorithm and so on. 
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This paper proposes Multi-Agent contract net protocols based on two-way negotia- 
tion mechanism. The fuzzy control system is designed to couple multiple factors on the 
tasks to assess to their dispatching emergency degree. The steps of ant colony algorithm 
considering obstruction status were designed to find the optimal route. After fuzzifica- 
tion and defuzzification, the decision query table of dispatching plan is obtained. 


2 MAS Framework of Container Terminal Scheduling 


Container terminal consists of containers, ships, handling equipment (QC, YC, and CT) 
communication equipment, berths, container yards and human resource, etc. Superior 
and subordinate subsystems have command and obedience relationships, and parallel 
subsystems have collaboration, consultation, competitive relationships. Figure 1 shows 
the hierarchical structure of the whole system. 
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BA-Berth Assigning; QCS-Quay Cranes Scheduling; CTS-Container Trucks 
Scheduling; YCS-Yard Cranes Scheduling; YA- Yard Allocating 


Fig. 1. Framework of container terminal schedule system based on Multi-Agent 
As we can see, there are three types of individual agents, namely Control Agent 


(fixed), Execution Agent (moving with wireless mobile terminal) and Operation 
(Mobile) Agent (dynamically generated). 
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Operation Agents are created dynamically and freely by the Central Processing 
Agent throughout the whole system just like computer software, and they are assigned 
by the Control Agent, executed by the Execution Agent. 


3 Dynamic Dispatching Strategy 


In the traditional static scheduling mode, the truck is unloaded at half of the distance. A 
dynamic dispatching strategy in which multiple tasks are matched with multiple CTs 1s 
adapted in this study, which can guarantee the truck in the idle state can be quickly put 
into the operation point which needs trucks. So how to arrange a truck for the task should 
be considered. In this paper, considering the load of communication and consultation 
efficiency in system, Multi-Agent contract net protocols based on two-way negotiation 
mechanism was adopted to achieve the best arrangement. Figure 2 shows the contract 
net. 


Control Agent Execution Agent 





Fig. 2. Contract net 


In the contract negotiation protocol Agent network, different types of equipment in 
collaboration between the task request and response with announcing, bidding, and 
awarding in contract net. In evaluating tender, fuzzy set theory and method is adopted 
to couple multiple factors on the tasks to assess to their dispatching emergency degree. 
The higher evaluation of dispatching emergency is, the greater probability to select the 
CT is. 
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4 Fuzzy Set Theory and Method 


The structure of fuzzy reasoning controller is shown in Fig. 3. Here are the two inputs, 
namely, predicted travel distance, priority of handling operation. The output is 
dispatching emergency degree. Dispatching decision knowledge set consists of n subset, 
R1, R2, ... Rn. Each subset contains some knowledge of dispatching decision, which is 
described in a matrix form according to certain rules. The input information is given 
according to bidding document and database of Agents. The meaning or interpretation 
of the function module is based on the input linguistic variables to determine its value 
corresponding fuzzy sets Al, A2; Effect of fuzzy reasoning is to use and expertise subset 
R1, R2,... Rn, and the information of synthesis module A to obtain comprehensive 
information on the fuzzy relation composition operations in order to generate output 
result. The result is also indicated by fuzzy sets. Linguistic matching function with the 
fuzzy decision module is translate the result into output language, or the result of this 
reasoning is converted to exact amount of that task dispatching emergency degree, to 
which Agents can refer to bid. 
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Fig. 3. Modal of fuzzy assessment for distributing tasks 


Heavy weights and the closed space on the dock make the truck more likely to be 
involved in traffic crash which leads to road congestion. So the travel time and distance 
from the current position of CT to the loading/unloading position cannot just calculated 
according to the shortest route. Through judging obstruction status, the steps of ant 
colony algorithm were designed to find the optimal route. Then the predicted travel 
distance and time can be calculated. 


4.1 The Ant Colony Algorithm Design for the Optimal Route of Container Truck 
Definition and Control of Congestion 


Definition and Control of Congestion. Traffic flow model is a kind of math equation 
to express the correlation of traffic parameters like speed, density and flux etc. Obstruc- 
tion status can be judged according to related data gathered by GPRS. Take the spur 
track 1 — j for instance. 
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uj = u;p(1 = k;/Kijo) 


qj = ku; 
u.: The vehicle speed 
The free driving speed 
k;: The vehicle density 
j 


kig The jam density 
50° 
q; The traffic flux 


Suppose the average minimum distance headway at the port of congestion density 


as h. then 


jj and average minimum time headway as h 


ijt 


ko = 1000/h 


ijs 
diim = 3600/ hiss kin = kijo/ 2 
Yim = dijm/Kijm 


dijm: The maximum traffic capability 
fe The corresponding density 

jm" 
Uj: Critical speed of the congestion 


The judging procedure about the congestion for the spur track is as follows: 


Step 1: If qij > dijm then switch to step 3, otherwise, switch to step 2. 

Step 2: Ifu; < Ujp then switch to step 3. 

Step 3: The spur track is under congestion, and the jam knots are recorded in the 
congestion table Ck. 


Confirming Principle of Allowed Knot Setallowedk. Considering the reality of dock, 
the container is abstracted into some grids. The dynamic control over the change of ants 
position comes to be true through modifying Relation Matrices by utilizing coordinate 
to confirm the direct pre-knot or sub-knot. 

The allowed knot set allowedk equals n minus Ck and Pk at knot 1 for ant number k, 
in which n is the total number of knots and Pk is the knots passed by ant k. There is no 
repetition at the same knot for the consequent knots considering the traffic rules of 
container terminals. 

The Steps of Ant Colony Algorithm of Optimal Route for Container Trucks. 


Step 1: Initialization. To set up a starting point and ending point and set the maximum 
number of cycles Nmax. To initialize the control parameters a, B, Q (user- 
defined). At the initial, m ants are randomly put on n nodes and the amount of 
information on each path is equal. The data gathered by GPRS for every spur 
track are: uij, hijs, hijt, qij. 
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Step 2: Put the initial knot of ant k into Pk to search till find target knot. If does, end 
the loop, and turn to step 3. If doesn’t find the target knot, determine allowedk 


and work out the transfer probability p: then adjust knot and keep searching. 


The transfer probability from knot i to j of ant k can be defined as: 


tn’ 
ij lij 
; -3 ep OE allowed, 
— a 
Pi — > tali 
sEallowed, 
0, else 


Step 3: End a loop iteration when the last ant find the target knot and record the length 
Lk. 
Step 4: Update the pheromone intensity 


If the arrival time is between ż and t + /, the pheromone intensity is to be updated 
according to the following formula: 


T(t + 1) = pt,(t) + AT; 


D ETES 
Art =? Tk ife, j) E T£) 
; 
0 else 

Qis a constant denoting the pheromone intensity. Lk is the gross route length covered 
in repetition by the ant number k. p symbolizes the left-over elements of pheromone, 
and p ought to be defined as a number less than 1 in order to avoid the endless accumu- 
lating of pheromone. 


Step 5: To compare all the routes passed by the ants and the minimum route length is 
the optimal route. To output the result, the algorithm is to be ceased. 


Case Analysis. The optimal route and the shortest route are shown in Figs. 4 and 5, 
respectively. 

The speed of the spur tracks which are under the congestion is 10 km/h, and the speed 
of other spur tracks are 25 km/h. The length of each grid in the map is 50 m, and the 
oblique line is simplified into two straight segments. The travel time and distance 
between the current position of CT to the loading/unloading position are calculated. The 
distance and time of the optimal route is respectively 1.25 km, 0.05 h, and the shortest 
route respectively 0.95 km, 0.06 h. The application of ant colony algorithm can help to 
search out the optimal route while the route is under the congestion. The study of optimal 
route is helpful in instructing the running of container trucks so as to avoid or relieve 
traffic congestion. 
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E: The starting point; Ml : The ending point; Ml : The container; 


Fig. 4. Chart of the optimal route 





Fig. 5. Chart of the shortest route 


4.2 Fuzzification 


Fuzzification of Distance. Table 1 shows container truck steer distance partition. 


Table 1. Container truck steer distance partition 


Class a o h2 bB o a s e 7 B 


R (0, 100] | (100, (300, (500, (700, (900, (1200, |>1500 
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R represents distance. According to the actual situation, the distance is segregated 
into 8 grade: 1, 2, 3, 4, 5, 6, 7, 8. The fuzzy linguistic value is defined as very near, near, 
nearer, general, far, farther, very far, extremely far, and the corresponding fuzzy set is 
N1, N2, N3, N, R3, R2, R1, VR. According to the distribution of discrete points, the 
triangular membership function is adopted for its convenience. 


Fuzzification of Priority of Handling Operation. Suppose P as the artificially defined 
priority, N as the number of arranged CTs, and L as the priority of handling operation. 


LE=P-N 


The fuzzy linguistic value is defined as unimportant, a little important, important, 
very important and extremely important, and the corresponding fuzzy set is P1, P2, P3, 
P4, VP. 


Fuzzification of Dispatching Emergency Degree. Suppose S as dispatching emer- 
gency degree. The fuzzy linguistic value is defined as very low, lower, low, general, 
high, higher, very high, and the corresponding fuzzy set is D1, D2, D3, M, M2, M1, 
VM. 


Determination of Fuzzy Rules. Scheduling principle: 


(1) Dispatcher’s assignment is preferential; 
(2) The important task has priority; 

(3) The close distance is preferential; 

(4) The requirements of QCs are preferential. 


The experience of terminal scheduling engineers and operators is summarized, so 
that the fuzzy rules can maximally reflect the actual scheduling principle of terminal. 
Table 2 shows the rule table for fuzzy control. 


Table 2. Rule table for fuzzy control 


Pl 
P2 
P3 
P4 
VP 





Fuzzy Inference and Defuzzification. Method of centroid is used in this system. Clear 
outputs corresponding to each of the input values are calculated through MATLAB tool. 
Table 3 shows the decision query table. 

In practical work, the terminal management system can directly query the table for 
the dispatching emergency degree. 
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Table 3. Decision query table 


i R B ~h b j6 7 B 
3.04 |27 
3.27 |34 
3.43 [3.35 
3.56 [3.44 
3.78 | 3.52 
3.81 | 3.59 
3.91 |369 


Figure 6 shows the relationship between R, L and S in MATLAB. 


NYDN! NR WIN] |WN 





Fig. 6. System output result schematic 


The two axis represents the input of the system (L, R), and the vertical axis represents 
the output (S). It can be seen from this figure that S increases with the decrease of L and 
R. Space surface is smooth, illustrating the design of membership functions and fuzzy 
rules are basically correct. 


5 Case Analysis 


There are four operations requested, shown in Table 4, and three idle CT, shown in 
Table 5. After CT dispatching Agent announce the four tasks, three free CT Agents gain 
task dispatching emergency degree by fuzzy reasoning controller, shown in Tables 6, 7 
and 8. 
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Table 4. The information of cooperative request for transporting 


Task serial number Request equipment Task priority Loading container 
number number 


QCI05 5 | CHPU2306280 
QC111 4 | SNBU476253 
YC203 4 U88300 
YC208 3 | AM2U4166740 


BLEW!) N =e 


Table 5. The information of idle CTs 


CT number Operating state 
CT301 idle 
CT311 idle 
CT312 idle 





Table 6. Task dispatching emergency degree evaluated by agent of CT301 


number emergency degree 
CT301 4.34 
cr3oi Qc o e Jow aa 
CT301 950 3.72 
CT301 3.43 


Table 7. Task dispatching emergency degree evaluated by agent of CT311 


number emergency degree 
CT311 cs, S850 4.08 
C31 Qc 850 3.87 
cu jyc23, 9850 3.72 
CT311  |YC208 650d 8.65 


Table 8. Task dispatching emergency degree evaluated by agent of CT312 


number emergency degree 
C7312 4.08 
C7312 3.9 
C7312 3.87 
C7312 3.72 


Three CT Agents bid to QCIO5 whose emergency degree is the highest. CT 
dispatching Agent evaluates these bidders, shown in Table 9. As the table shown, 
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evaluated dispatching emergency degree of CT301 is the highest (shown in Table 9), 
CT dispatching Agent rewards Agent of CT301 and meanwhile refuses to application 
of CT311 and CT312. 


Table 9. Task dispatching emergency degree sort of QC105 


Request equipment CT number | Task important | Distance Task dispatching 
number degree emergency degree 
QC105 CT301 300 4.34 


QC105 CT312 4.08 


QC105 CT311 850 4.03 


Then CT311 and CT312 select tasks to bid, whose dispatching assessed value of 
emergency is the highest among the remaining, namely QC111 and YC208. CT 
dispatching Agent respectively issue orders to CT311 and CT312 after evaluating. 


6 Conclusion 


A model of container terminal scheduling system was established based on Multi-Agent 
and how to dispatch Container Trucks (CTs) in dynamic dispatching strategy was 
studied. Relationship between transport tasks and service of CTs has been taken as a 
contract net using the fuzzy theory and method of optimum allocation of multiple tasks 
to multiple CTs, and the bidirectional negotiation mechanism was adopted. It coupled 
multiple factors on the tasks to get the assessment score of dispatching emergency. 
Through judging obstruction status by using related data of the GPRS system, the steps 
of ant colony algorithm were designed to calculate the distance of the optimal route. 
Further research is necessary to study on the method considering more practical factors 
before it can be applied in practice and the optimization of algorithm. 
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Abstract. Since the concept of mobile Internet, big data and cloud computing 
has been proposed, the format of the information undergoes a tremendous change, 
information gradually appear problems in large quantities, the complexity and the 
data form is not fixed. This is not a small impact and challenge to the traditional 
fixed form protocol analysis. Dynamic processing and analysis of data is very 
important for the data processing flexibility, easy scalability and better fault 
tolerance. Dynamic is mainly reflected in the dynamics of dynamic definition, 
dynamic analysis and dynamic processing. In the dynamic analysis of the data, 
you don’t need to set the specific format of the data in the program, only the 
framework of dynamic analysis need to be constructed in the sending and 
receiving programs, the programs can read the format of data automatically, and 
get the content of data easily. This approach can make the program more suitable 
for the current application in size and flexibility than the traditional form of a 
fixed communication protocol analysis. Redundant structures does not need to 
add into the programs, which is no longer processing data passively. Data trans- 
mission becomes more convenient, because the data format needn’t to be token 
care when the administrator wants to transfer the data. 


Keywords: SQLite -Communication protocol - Dynamic analysis 
Message buffer 


1 Instruction 


At present, the format of messages transmitted between application processes on different 
terminal systems is mostly defined in programs, and it can’t be changed when users 
transmit messages with application processes, the flexibility of dynamically define message 
format is lacked [1]. Once a user modified the message format, changes to the format of 
messages in code layer by program developers would be required, and part of communica- 
tion procedure of application would need to be retested, this process will consume a lot of 
manpower and material resources [5]. The format of the information undergoes a tremen- 
dous change, information gradually appear problems in large quantities, the complexity and 
the data form is not fixed. This is not a small impact and challenge to the traditional fixed 
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form protocol analysis. Dynamic processing and analysis of data is very important for the 
data processing flexibility, easy scalability and better fault tolerance. 

The analysis framework of application layer communication protocol based on SQLite 
[6], which is proposed in this paper, presents the format of message defined in the program 
in the form of database table, users can configure the table flexibly according to needs, and 
dynamically analyze it when in using. The whole framework contains protocol develop- 
ment process, dynamic object generation process, protocol analysis process, message 
combination sending process, message receiving process, data stored procedure, commu- 
nication process. Dynamic is mainly reflected in the dynamic definition of communication 
protocol, dynamic analysis and dynamic processing. In the dynamic analysis of the data, you 
don’t need to set the specific format of the data in the program, only the framework of 
dynamic analysis need to be constructed in the sending and receiving programs, the 
programs can read the format of data automatically, and get the content of data easily. This 
approach can make the program more suitable for the current application in size and flexi- 
bility than the traditional form of a fixed communication protocol analysis. Redundant 
structures does not need to add into the programs, which is no longer processing data 
passively. Data transmission becomes more convenient, because the data format needn’t to 
be token care when the administrator wants to transfer the data. 

In view of the SQLite memory database belongs to the lightweight database, with 
superiorities such as less resource consumption, portability, security, reliability, zero 
management costs and so on, we will use the SQLite memory data as the base carrier to 
implement the message format storage and high-speed information buffer in this paper. 


2 SQLite Database Technology 


SQLite memory database is an embedded database engine. Specifically suitable for appro- 
priate data access on a variety of equipment with limited resources (such as mobile phones, 
pads and other intelligent devices) [3, 7]. It follows the ACID relational database manage- 
ment system, its design goal 1s embedded, and now has been applied in many embedded 
products, and its source code is abundant on the official website. SQLite memory database 
takes up very few resources, the memory is only occupied in the order of 100 k bytes in the 
embedded device [8]. It can be supported by mainstream desktop operating systems like 
Windows/Linux/Unix and all the mobile operating system platforms, and can be combined 
with plenty of programming languages, such as C#, C/C+, PHP, Java and ODBC inter- 
face, and it 1s faster than MySQL and PostgreSQL, which are two of the world famous open 
source database management systems [4]. The SQLite supports the most of the SQL92 
syntax, and allows developers to use SQL statements to manipulate data in the database, and 
the SQLite is just a file, that doesn’t need to be installed or started the server processes as 
databases like Oracle and MySQL. It has the following characteristics: 


(1) Lightweight: The SQLite is a built-in database, all the database operations can be 
completed by it with a dynamic library. 

(2) Portability: The SQLite can be run in a variety of environments, we can see from 
the source code provided by the official website, not only the desktop side, but also 
the mobile side is covered by the mainstream platform. 
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(3) Security and Reliability: When a certain data need to be accessed by multiple 
processes simultaneously, data in the database can be read but cannot be written at 
the same time by these processes, which ensures the reliability of the data. 

(4) Green software: Another feature of SQLite is green: its core engine doesn’t rely on 
any third-party software, and installation does not be required. So a lot of trouble 
can be saved at the time of deployment. 

(5) Easy to manage: The various data information in SQLite involve graphics, tables 
and other files are isolated from each other, to ensure the mutual interference would 
not take place, and also facilitate the operations on the database. 


3 Overall Structure of the Application Layer Communication 
Protocol Analysis Framework 


3.1 Architecture Design of Analysis Framework 


Figure 1 shows the architecture design of the application layer communication protocol 
analysis framework. In this framework, the definition between the sender and receiver 
is very vague, so we do not deliberately distinguish the sender or receiver. 
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Fig. 1. Overall structure of the application layer communication protocol analysis framework 


Firstly the framework provides the user with a data table that can be created and 
modified, and this table stipulates the format in which the protocol is created. The 
protocol can be analyzed by software after it has been created, that is, getting into the 
protocol analysis process. During the protocol analysis process, the object of each 
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attribute in the protocol is generated by starting the dynamic object generation process. 
The message combination sending process should be started if the simulation software 
needed to send data after the protocol has been analyzed. 

During the message sending process, the UDP or DDS [2] interface in communica- 
tion process will not be called directly for data transmission, but starts the data stored 
procedure firstly, stores the data to be sent into the memory database, and then throw a 
specific message to the communication process, which initiates the send function after 
receiving this message, retrieves the data to be sent from the memory database and sends 
it to the destination. If a message sent by the external software is received by the 
communication process of the simulation software, this message would not be analyzed 
but the raw string should be stored in the memory database firstly after the UDP or DDS 
communication interface receives the information, and then a specific message should 
be thrown by protocol analysis process which retrieves the raw string data from the 
memory database and initiates the protocol analysis function to complete the analysis 
of the message after receiving it. In the analysis process, message is associated with the 
specific agreement according to the information identification, and finally completes the 
analysis of information according to the protocol. The data which have been analyzed 
would be stored in the memory database again after the analysis process, and then a 
specific message should be sent to inform the business application layer to complete the 
corresponding business functions, such as display, calculation and so on. 


3.2 Communication Process Based on Framework 


The basic communication process of the software based on framework implementation 
has been shown in Fig. 2. As can be seen from the figure, both the data sending and 
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receiving is centered on SQLite memory database. Taking memory data as the center is 
an another important feature of the framework proposed in this paper. 


4 Analysis Method of Application Layer Communication Protocol 


4.1 Protocol Configuration Process 


In the analysis method of application layer communication protocol, information is 
configured according to the information format, and the received information is identi- 
fied to accomplish the analysis of the received information and the formatted storage of 
the information, and the analysis method has the function of calling the business appli- 
cation in real time through the service manage center. Users configured by the commu- 
nication protocol can configure through the configuration files dynamically, the real- 
time information and business application scheduling relationship users can also 
configure dynamically through the configuration file. 

The purpose of configuring the information protocol is that the information protocol 
can be dynamically configured by the software developer. The framework can analyze 
the received information according to the identity of the configuration pair and the 
message after the information protocol is configured in database. The configuration of 
the information protocol is configured by the developers of the simulation software. The 
configuration is as follows: 


(1) Information index configuration 
Database name: Config.db 
Database path: main_project_directory\Bin\Win32\Config 
Database table: MessageDictionary 


When configuring the information protocol, the basic information in the information 
index table should be configured at first. As shown in Table 1, RecNo in the configuration 
table is the information number automatically generated by database; ID is the protocol 
number (not repeatable); Name is the information protocol name; and ReMark is the 
information protocol description. 


(2) Information protocol configuration 
Database name: StructMessage.db 
Database path: main_project_directory\Bin\Win32\Config 
Database table: MessageDictionary 


Table 1. An example of the protocol configuration 


ID Type 
InFON (0 (16 Charl} 

2 Charl] 

3 FLOAT 

4 FLOAT 

5 DOUBLE 
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Each information establishes a protocol configuration table which is named by the 
name of the information protocol (name of the Name item) in the MessageDictionary 
table. 

As shown in Table 1, RecNo in the configuration table is the protocol field number 
automatically generated by database; ID is the field number in the protocol (not repeat- 
able); EnName is the English name of field; startIndex is the number of the first byte of 
the field in the entire protocol, the byte of protocol is numbered from “0”; ByteCount is 
a count of bytes occupied by the field; Type is the field type; Rule is an additional 
processing instructions of the field; and ReMark is the field description. 


4.2 Protocol Analysis Implementation 


(1) Dynamic generation of the object: 
ClassCreator::ClassCreator(constchar *“cName, CreateClass cc) 
{ 
static std::map<std::string, CreateClass> s_classMap; 
pMap = &s_classMap; 
map<std::string, CreateClass>::iterator it = pMap->(className); 
if Gt == pMap->end()) 
{(*pMap)[className] = cc;} 
} 


This code is a dynamic generation of the object, and the object needs to be generated 
before dynamically analyzing in order to call the analysis process. 


(2) Process of dynamic analysis 
bool CStructMessage::ParseMessage(unsigned char* pBuf) 
{ 
for(int 1 = 0; 1 < m_StructMessageVec. size(); 1++) 
{m_StructMessageVec[i]->SetValue(pBuf); } 
return true; 


} 


Through using a for-loop on valid data in the database, this code analyzes each 
segment of data into an available value, and combines this values into a valid string for 
transmission in the next step. 


5 Message Buffer Design 


The whole system is divided into three modules by function in order to implement the 
message buffer: first, Socket communication module, the module is mainly used to 
accept and transmit messages which require to be buffered, such as string, structure, etc.; 
second, SQLite message access module, the module is to access messages, which need 
to be buffered, in the SQLite database; third, the forwarding module of Windows 
message mechanism, the module generates a message reminder in the process and then 
distributes it; The overall structure is shown in Fig. 3. 
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Fig. 3. Structure of the message buffer 


In the implementation of the message buffer, communicating between programs is 
needed, so the Socket network communication technique has been used. Messages are 
transmitted through the sockets between the program A and the message buffer. Sockets 
programming can be used in three ways, streaming socket (SOCK_STREAM), datagram 
socket (SOCK_DGRAM), raw socket (SOCK_RAW); this program is designed based 
on UDP socket programming uses streaming sockets. The following describes the 
specific implementation of the program: The programing steps of the sender of program 
A: 1, Load the socket library, and create the socket (WSAStartup()/socket()); 2, send 
the connect request to the receiver of buffer (connect()); 3, communicate to the receiver 
of the message buffer (send()/recv()); 4, close the socket, and close the loaded socket 
library (closesocket(/WSACleanup()). 

When the message has been received by the message buffer through the socket, it 
will be stored in the SQLite database according to the type, the stored messages has 
different types, so the SQLite message storage operations are not same, too. Step 1: load/ 
release the Winsock library; Step 2: connect to the SQLite; Step 3: different storage 
operations are performed according to different message types. 

When the message has been stored in the SQLite database, the system will send 
message reminder to different windows based on the different types of messages, which 
involves the windows message mechanism and dialog window design with MFC, and 
this will be introduced in next two sections. Messages will be processed by MFC window 
after the window receives a message reminder, but due to the limitation of design, the 
message processing in this experiment is simply simulated, which selects messages from 
SQLite database and displays them to the MFC Edit controls. 


6 Conclusions 


The analysis framework of application layer communication protocol based on SQLite, 
which is proposed in this paper, presents the format of message defined in the program 
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in the form of database table, users can configure the table flexibly according to needs, 
and dynamically analyze it when in using. 

The framework that is proposed in this paper can make the program more suitable 
for the current application in size and flexibility than the traditional form of a fixed 
communication protocol analysis. Based on the framework we proposed, redundant 
structures does not need to add into the programs, which is no longer processing data 
passively. Data transmission becomes more convenient, because the data format needn’t 
to be token care when the administrator wants to transfer the data. 

However, this paper only uses one type of database, SQLite, and only implements 
one kind of communication mode, UDP, so that the application scope of this framework 
is still limited. In the future work, increasing the type of database and expanding the 
communication model will be an important way to develop this framework. 
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Abstract. The BioDynaMo project was created in CERN OpenLab V and aims 
to become a general platform for computer simulations for biological research. 
Important development stage was the code modernization by changing the archi- 
tecture in a way to utilize multiple levels of parallelism offered by todays hard- 
ware. Individual neurons are implemented as spherical (for the soma) and cylin- 
drical (for neurites) elements that have appropriate mechanical properties. The 
extracellular space is discretized, and allows for the diffusion of extracellular 
signaling molecules, as well as the physical interactions of the many developing 
neurons. This paper describes methods of the real-time growing brain dynamics 
simulation for BioDynaMo project, a specially the implementation of growth 
guidance factor diffusion via octree spatial structures. 


Keywords: Biodynamo - Growth guidance factor - Diffusion - Octree 
Octree spatial structures - Neuronal systems simulation 


1 Introduction 


The BioDynaMo project [2] aims at a general platform for computer simulations for 
biological research. It is collaboration between CERN, Newcastle university, Innopolis 
university and Kazan state university where is all sides take possible part in the project 
according their professional skills. The main idea of the project was to close the gap 
between very specialized applications and highly scalable systems to give life scientists 
access to the rapidly growing computational resource [1]. Since the scientific investi- 
gations require extensive computer resources, this platform should be executable on 
hybrid cloud computing systems, allowing for the efficient use of state-of-the-art 
computing technology. 

First development stage was the code modernization based by neurodevelopmental 
simulation software Cortex3D [15] transforming the application from Java to C++ and 
changing the architecture in a way to utilize multiple levels of parallelism offered by 
todays hardware. 
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In the current state of the project distributed teams working on first iteration: devel- 
oping spatial organization layer [7] using Octree Encoding with simulation of neuronal 
growing and extracellular diffusion and visualisation of these processes. 

These tools designed for modeling the development of large realistic neural networks 
like as the neocortex, in a physical 3D space. Neurons arise by the replication and 
migration of precursors that mature into cells that can expand axons and dendrites. Indi- 
vidual neurons are represented through spherical (for the soma) and cylindrical (for 
neurites) elements that have appropriate mechanical properties. The growth functions 
of each neuron are encapsulated in set of predefined modules that are automatically 
distributed across its segments during growth. The extracellular space is also discretized, 
and allows for the diffusion of extracellular signaling molecules, as well as the physical 
interactions of all developing neurons. 

Typical approaches for space discretization include logically structured grids, block 
structured and overlapping grids, unstructured grids, and octrees. Structured (regular) 
grids are relatively easy to implement, have low memory requirements but limited 
adaptivity, whereas unstructured grids can conform to complex geometries and enable 
non-uniform discretization. 

Unstructured grids also have the overhead by explicitly storing element-node 
connectivity information and in general being cache inefficient because of random 
access. Regular grids are easy to generate but can be quite expensive when the solution 
of the PDE problem is highly localized, and localized solutions can be executed more 
efficiently using unstructured grids. Octrees [12] offer us a good balance between adap- 
tivity and good performance. They are more flexible than regular grids, the overhead of 
constructing element-to-node connectivity information is lower than in unstructured 
grids, and they allow calculations without the use of matrix’s. 

In this work we focus on reducing the time to build data octree structures, memory 
overhead and time to perform numerical solution of diffusion equation using these data 
Structures. 


2 Octree Data Structure 


An octree is a tree data structure that is used for spatial decomposition in which every 
node (octant) has maximum eight children (see, Fig. 1). They are analogous to binary 
trees (maximum 2 children per node) in 1D and quadtrees (maximum 4 children per 
node) in 2D. The only octant with no parents is the root, and an octant with no children 
is called a leaf. An octant with one or more children is called an interior octant, octants 
with the same parents called siblings. Complete octrees are octrees in which every inte- 
rior octant has exactly eight children. The depth of an octant from the root 1s referred to 
as its level. Octrees can be effectively used to partition cuboid regions, and these regions 
are refereed to as the domain of the tree. Geometric modeling technique called Octree 
Encoding firstly was presented in [10] at 1981. 
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Fig. 1. Left: One large cell neighboring 8 smaller cells. The ci represent the x components of the 
intermediate concentration c defined at the cell faces. Right: Zoom of one computational cell. The 
concentration components are defined on the cell faces, while the pressure is defined at the center 
of the cell. 


3 Diffusion Simulation Using an Octree 


Diffusible extracellular signaling molecules plays important role in developing neuronal 
systems. They are are synthesized in by different sources in neuronal tissue and reach 
their target due to diffusion process. These signals can be further combined and modu- 
lated by processing of the molecule both directly at the cell surface and by the mecha- 
nisms of intracellular signaling which are activated [4]. Simulating the diffusion of 
growth guidance factors in the 3D model space is a difficult problem, and computation- 
ally expensive [9, 14]. To keep diffusion computationally tractable, the diffusion space 
must be quantized at a resolution that somehow matches the precision required by the 
cellular detection mechanisms, cellular density, etc. Various physical and biological 
processes are modeled using parabolic equations. The finite element method is a popular 
technique for solving parabolic partial differential equations (PDEs) numerically. Finite 
element methods requires grid generation to generate function approximation spaces. 

The concentration c; = c;(r;—r;, t) of growth guidance factor at the point r; in the 
moment t depends on the concentration of growth guidance factor released by i-th cell 
located at the point r;. The concentration obeys standard diffusion equation 


OC; r 
TP Aa ci + kc; = J,(r,, t). (1) 
Here D? — the diffusion coefficient, k is the degradation coefficient, /\, is the Laplace 
operator in 3D space, and the quantity J,(r,, t) means the source [6]. 
Figure | illustrates unrestricted octree data structure (see e.g. [12]) with a standard 
grid arrangement [8]. This is convenient since interpolations are more difficult with cell 
centered data (see e.g. [13]). Coarsening is performed from the smaller cells to the larger 
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cells, i.e. from the leaves to the root. This procedure shown by algorithmic pseudo-code 
(see, Algorithm 1). 


Data: Octree data structure. 
Result: Diffusion process. 
// Produce Substances Step, where J — produce coefficient is unique for 
different substances. 

for V Source do 

Concentration — +J source 

if Concentration > 1 then 

| Concentration — 1 
end 





end 
// Diffusion Step, where D? is the diffusion coefficient. 
for V Octree.Leaf do 

| Concentration — +2D? x (Neighbor — Current) 
end 
// Decay Step, where k is the degradation coefficient. 
for V Octree.Leaf do 

| Concentration — x(1 — k) 
end 


2 
3 
4 
5 
6 
7 
8 
9 





Algorithm 1: Pseudo-code for produce substances, diffusion and decay 
steps of simulation using octree. 


Time in seconds 


0 20 40 60 80 100 
Iteration number 


Fig. 2. The dependence of diffusion step solving time on number of octree nodes: The red line — 
regular grid, blue line — octree structure. 
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4 Results 


We simulated the diffusion process with 100,000 nodes for a regular grid and using 
octree spatial structures. Figure 2 demonstrates the time dependence of diffusion 
problem solution from number of octree nodes. Comparing the results of these two 
modes, one can observe the efficiency of using octree on large number of nodes. By 
using octree spatial structures we get a significant reduction in computation time. 

In Fig. 3 shows the results of simulation as a contour plot of substance concentration 
for system containing two sources. The the 2D snapshots are shown for 4 time moments. 
The diffusion process is implemented realistically what can be seen in the Fig. 3. 





Fig. 3. Visualisation of the diffusion process of two sources for 4 different iterations a) 15, b) 26, 
c) 41, d) 70. 


5 Conclusion 


In computational physics, there has been plenty of work using either multilevel grids or 
adaptive mesh refinement to improve the computational efficiency. 

Quadtree or octree-based adaptive refinement have also proposed in [3, 5, 11]. An 
octree-based algorithm for diffusion process simulation, developed in this work, is used 
for modeling growth and development in neural networks [15]. This method adaptively 
subdivides the whole simulation volume into multiple subregions using an octree. Each 
leaf node in the octree also holds a uniform subgrid which is the basic unit for simulation. 
A novel node subdivision and merging scheme is developed to dynamically adjust the 
octree during each iteration of the simulation. 

Further development of the project involves the development of octree algorithms 
to perform in the cloud using parallel computations. The next stage in the development 
of diffusion of substances will be devoted to the realization of existing algorithms by 
using methods of parallel calculations. 
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Abstract. For the diversity of applications and traffic, the requirement of Quality 
of Service (QoS) is various. In order to improve utilization of resource, perform- 
ance of the network and experience of users, the QoS aware routing approaches 
have been researched. As the Software Defined Networking (SDN) emerging and 
the machine learning developing, this paper proposes a mechanism based on flow 
classification and routing strategy selection. This architecture is running as 
modules on SDN controller. In Mininet platform, we conducted experiments on 
POX controller to verify effective of our design and test the performance. The 
measuring results show that throughput and delay of the network is distinct while 
using different routing theory. 
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1 Introduction 


For the rapid development trend, high bandwidth and low delay requirements are major 
challenges with the emergence of new techniques like mobile communication and 5G. 
Since the network traffic is increasing due to a wider spectrum of new applications, it is 
important to address the essential trade-off problem of appropriate allocation of restrict 
network resources and relatively efficient routing computing method. The QoS demands 
required by different applications are diversity, and there are various flows existed in 
same application. In addition, even in the similar types of applications there exists some 
different QoS demand flows. For instance, real-time multimedia applications usually 
require high QoS demands on bandwidth and delay but are tolerant to packet loss rate, 
while the high definition video streaming and the online competitive gaming have high 
data rate need and strict latency guarantee respectively. Furthermore, in the scenario of 
multimedia delivered network, there are diverse specific types of flows such as, video, 
audio, imaging, etc. 

Custom route of path decision should be made for each flow according to its QoS 
requirements and network conditions, which is conducive to the suitable assignment of 
limited network resource. On the one hand, if every flow selects the shortest path 
greedily, it would bring about local congestion so that the subsequent coming packets 
may be dropped. On the other hand, more packets could be accepted with balancing 
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network traffic although it may consume excessive resource and derive long propagation 
delay. Therefore, the routing theory needs to be customized for some special flows 
according to the perception of QoS requirements and the detection of network conditions 
in union, which is conducive to optimal QoS guarantee and network resource orches- 
tration [1, 2]. 

Moreover, there is no one algorithm can outperform all the remaining ones in every 
case, while there are always some scenarios where an algorithm performs well. 
However, the QoS aware theory makes the routing algorithm even more complex 
considering with classifying various flows to match dynamic and adaptive strategy 
respectively. Hence, the routing strategy should be trade-off between the optimality and 
simplicity without too much processing overhead or memory consumption due to its 
high complexity. Compared to some steadily and inflexible strategy aiming to realize 
better average performance, we present a novel mechanism to improve the network 
performance and reduce the computing complexity to some extent. 

Considering traffic distribution state, QoS requirements of flows and dynamic avail- 
able bandwidth resource, the proposed strategy enhances the high utilization of the 
overall resource at the cost of increasing resource consumed by individual flows. And 
to find the minimum cost path, the evaluation results show that it alleviates the perform- 
ance degeneration and simultaneously improves the utilization of whole network 
resource. 

SDN architecture consists of three layers: forwarding layer, control layer and appli- 
cation layer, as shown in Fig. 1. Physical switch is the crucial element of the forwarding 
layer, which takes on all the performance of forwarding data in accordance with the 
forwarding rules in flow tables provided by the controller. Besides, the centralized 
control plane is core layer of the network which takes the responsibility of topology 
discovery, traffic classification, strategy formulation and network configuration. There- 
fore, the network traffic is allocated to the paths operated directly by OpenFlow switches, 
where the flow-handling rules are installed by the central SDN controller [3]. 


Application Layer | OpenFlow 
application program Switch 
t t T API security OpenFlow 
| 


channel Protocol 





Control Layer 


network service SSL 


flow 


$ $ OpenFlow table 


Controller 
Infrastructure Layer 
setwork forwarding Q G GO 
Bj” |S] S 


Fig. 1. SDN architecture Fig. 2. The construction of OpenFlow switch 





The OpenFlow switch is composed by three components including flow table, secure 
channel and OpenFlow protocol as illustrated in Fig. 2. Flow table contains the flow- 
oriented forwarding rules issued by the relevant controller. Secure channel ensures the 
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interaction between the switch and the controller. And OpenFlow protocol regulates 
messages in secure channel to realize functions, like traffic statistics collection, access 
control and network configuration. 

As the SDN architecture can provide global visibility of network resources and 
inherent programmable interfaces, some explorations on the fusion of SDN and other 
technology have been carried out. Although some concepts have been presented several 
years ago, there are still many issues worth to be researching further, such as efficient 
self-organized routing mechanism, network resource management, QoS perception and 
environment forecasting [4—6]. 

SDN is a promising concept with the powerful advantage to introduce new dimen- 
sions of flexibility and adaptability in current communication networks. For instance, 
the realization of QoS aware routing becomes possible in an agile and dynamic manner. 
Although some routing algorithms have been well researched, they were not applied in 
communication networks due to its high implementation complexity and realization 
costs. We propose the QoS-aware resource management routing mechanism, which 
seems to be promising to provide better network resource management, traffic control, 
and application classification [7]. 

SDN paradigm radically transforms network architecture and provides some char- 
acteristics of programmable, efficient, fine-grained, dynamic, accurate, global-viewed, 
and superior computational capacity. All of these features prompt the development of 
network with using advanced optimization algorithm. 

Computational intensive machine learning algorithms integrate several online 
routing algorithms for the real-time control of new connection requests. Depending on 
the available resource and status of the traffic, the cognitive computing could classify 
the flows and make adaptive routing strategies. Since new variables and additional 
constraints are emerging over time, the routing mechanism is supposed to utilize 
machine learning techniques to select the best sequence of accommodative decisions, 
while considering the anticipative future scenarios for improving the performance of 
network. In this paper, the proposed dynamic adaptive QoS-aware routing mechanism 
is constrained with available network resource and distinct flow requirements upon SDN 
platform [8, 9]. 

The rest of this paper is organized as follows. Section 2 proposes implementation of 
the QoS-aware routing mechanism with algorithms in detail. Section 3 describes the 
simulations. Section 4 analyzes the experimental results and evaluates the performance. 
Finally, Sect. 5 outlines the future researches and makes a conclusion. 


2 Implementation and Algorithm 


Our QoS aware routing selection computing mechanism consists of five modules, 
including Traffic Measurement Module, Flow Classification Module, Topology Detec- 
tion Module, Routing Selection Module, and Configuration Updating Module, which 
are working collaboratively on SDN controller. 

First, the module Traffic Measurement collects these information of flow statistics 
from the forwarding layer. Then, Flow Classification module differentiates the QoS 
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requirements like bandwidth of real-time traffic based on analyzing the flow statistics 
information using data mining technique, such as K-means algorithm. Moreover, 
Topology Detection module gathers all the up-to-date information of the link state from 
switches, such as delay and bandwidth, with using traffic monitoring tools like sFlow 
or NetFlow. Furthermore, Routing Selection module includes two parts. On the one part, 
the smaller bandwidth demanding flows match to the routing algorithm of Bandwidth 
Reserved Shortest Path (BRSP). On the other part, the larger bandwidth demanding 
flows match to the routing algorithm of Load Balancing Considered Polling Path 
(LBCPP). Finally, the Configuration Update module updates the strategies by rotating 
above processes if the degree of the variation exceeds the setting threshold. 


2.1 Traffic Measurement Module 


Applications supporting various multimedia data have specific QoS requirements on 
bandwidth, delay and packet loss. And the module is responsible for collecting infor- 
mation about the traffic type. For instance, the applications like high definition video 
streaming or online competitive gaming, have high demand of data rate and strict 
demand of latency respectively. 


2.2 Flow Classification Module 


This module holds the purpose to find the reasonable and effective flow classification 
theory according to the perception of flow’s QoS requirements and the detection of link 
states from nodes. To trade off the complexity and the practicability of mechanism, the 
module measures the bandwidth requirement of flows, and calculates a critical value to 
divide the flows into two types. 

In order to classify the flows rationally, the clustering algorithm of machine learning 
is employed in this process. The K-means classification method make the previous traffic 
flows to fall into categories, such as mice prior flows and inferior elephant flows. When 
new flows reach the switches, the flows will be matched to the suitable class and 
forwarded along the path based on the adaptive theory. 

It is inevitable that the status of flows is changing ongoing. For instance, if the size 
enlarges and the amount increases, the mechanism should reserve more bandwidth to 
guarantee for the mice flows admission. Therefore, the residual network resource 
remained for the elephant flows should be confined to release more resource for the mice 
flows with priority. 


2.3 Topology Detection Module 


An OpenFlow-enabled network can be modeled as a network graph G = (S, E), where 
S represents OpenFlow-enabled switches and E indicates the link edge between two 
adjacent switches. Link state of edge is comprised of the link capacity, link delay, and 
link available bandwidth. 

Our mechanism attempts to avoid traffic congestion when incoming flow is added 
to the link. Cost of a link 1s calculated according to link capacity, link utilization, and 
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required bandwidth of incoming flow. This principle ensures that the link with lower 
utilization rate has smaller cost value. When link utilization comes near the link capacity, 
it will become expensive to occupy this link. 


2.4 Routing Selection Module 


The routing algorithms are committed to find the most feasible path for meeting the 
specific QoS requirements, which is differentiated depending on flow types: bandwidth- 
sensitive traffic, delay-sensitive traffic, and others type traffic. A feasible path can 
provide sufficient resource for the demands. However, flows of diverse requirements 
adapt to different optimal routing theory. Therefore, in the case of the QoS requirements 
being perceived, the proposed algorithms would assign the flow an optimal route which 
is beneficial to both the traffic quality and network performance. As explained following, 
the module consists of two routing algorithms. 

The routing algorithm of BRST is matched to the smaller bandwidth demand flows, 
which are delay sensitive, bandwidth constrained, and packet loss intolerance. Reserved 
bandwidth-guarantee could provide small jitter and low packet loss for some special 
flow in some multimedia applications. We find the path with the smallest delay for delay- 
sensitive flows. With consuming the reserved bandwidth, this theory could find the 
optimal route which can assure admission guarantee for the flow with higher priority, 
and compute the shortest paths for the smallest propagation delay. Considering the 
demand of the actual network, we adopt the Bellman-Ford algorithm to realize our BRST 
algorithm. If there is no feasible path, the other routing theory would be employed to 
find a path to ensure the delivery of the flow temporarily. And the update module starts 
to modify the corresponding information and settings immediately. 

On the contrary, the routing LBCPP is matched to the larger bandwidth demand 
flows which is bandwidth sensitive, delay constrained, and with moderate packet loss 
torment. The lager bandwidth is provided to improving the performance of network with 
load balancing theory. The routing algorithm of our mechanism attempts to avoid traffic 
congestion whether the states of the network or the features of traffic are changed. After 
the bandwidth reserved previously, the module uses the residual network resource to 
find the available paths from source to destination. And the optimal paths with enough 
bandwidth are figured out based on the delay constraint and packet loss. Since we 
consider hop counts as the only impact of propagation delay, we set the hop constraint 
not exceeding the 1.2 times value of that in the shortest path. We use a Top-K paths 
selection algorithm thinking about available bandwidth, delay, and utilization rate to 
decide the route by round robin method. Path load balancing is used for distributing 
workload to reinforce network reliability and optimize network resource utilization. 


2.5 Configuration Updating Module 


Due to the instability and the resource restriction of network environment, more atten- 
tions should be paid to the continuously emerging variation. For accommodating the 
variance with regard to the distribution of the flows and the condition of the network 
over time, the routing theory should be dynamically adjusted the critical value of flow 
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classification and the amount of reserved bandwidth. Meanwhile, the flow table installed 
in the switches should be updated necessarily for efficient network usage. 

In the module, the update process would be triggered if the classification metric or 
the reserved bandwidth is out-of-date, which means that the variance is exceeding the 
threshold of extent or time. If some flows matching the first routing algorithm cannot 
find feasible path using the reserved bandwidth, the flow would use the other routing 
theory to ensure forwarding temporarily. And the reserved bandwidth should be 
increased to avoid packet losing with providing better QoS. 

In addition, the traffic forwarding rules in flow table are generated by the central 
controller and installed to the corresponding OpenFlow-enabled switches. As the 
network traffic and topology are unpredictable, the flow table will be dynamically 
updated to keep consistent with the employed strategy. On one hand, paths are timely 
evaluated whether new flow arrives or network changes. On the other hand, the global 
statistical information is timely updated and the flow table is set to delete overtime 
entries. We employ suitable period and threshold of updating theory to trade off the 
consistency of real-time state information and overhead of frequently updating commu- 
nication in the control plane. 


3 Simulation and Setting 


In order to evaluate the proposed architecture of SDN-based mechanism, it is imple- 
mented in POX platform and Mininet simulator to emulate the network topology. POX 
is a software platform for rapid exploiting and prototyping of network controller devel- 
oped by Python, which could invoke scripts to operate corresponding modules to derive 
the adaptive actions. Mininet provides a lightweight test bed for developing OpenFlow 
applications to help users to create a realistic virtual network, including a collection of 
hosts, Open vSwitches and network links on a single machine [10, 11]. 

And the whole simulation is running on an experimental computer with the operating 
system of Ubuntu 14.04. Therefore, the computing performance is influenced by the 
physical machine with essential parameters, such as Intel Core 15-4590 CPU @ 
3.30 GHz, and 4.00 GB RAM. 

In practice, data is commonly delivered to destination through multi-hop path in the 
complex topology. For the sake of simplifying the simulation, we conducted experi- 
ments in the topology graph showed in the Fig. 3. In the Topology, we set H1 and H2 





Fig. 3. Topology graph of simulation 
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as clients, while H3 and H4 as servers. Meanwhile, 9 Open vSwitches were built as the 
OpenFlow-enabled switches. And the bandwidth of each link was given to 10 Mbps. 

Under stationary condition, the flow distribution and the network state are invariable 
over time. We conduct experiments on the condition to test algorithms of our theory. 
Under dynamic condition, the moment of flows arrival processes could follow the distri- 
butions like Poisson. If the size of the flows is given, the departure time of flow related 
to the transmission delay could be equivalent to the bandwidth demand which is set as 
Normal distribution. This work conduct experiments on Mininet to evaluate our mech- 
anism and compare the results in custom dynamic condition, in which the UDP traffic 
of 5 Mbps is emulated as background flow from H1 to H3 by using the command of 
iperf. Using single routing algorithm, the throughput and delay are measured respec- 
tively when sudden flows are generated at some point. 


4 Evaluation and Analysis 


In the case of the application of real time multimedia, QoS metrics with respect to band- 
width, delay, delay jitter, and packet loss ratio are usually taken as major concerns of 
receivers, while the senders care more about the traffic transmission in high load and 
high randomness. For the network management, the network performance depends on 
the metrics, such as average link utilization, resource consuming, packet loss ratio, 
throughput, device energy efficiency, and average end-to-end delay. Therefore, the QoS 
aware routing problem could be converted to the multiple objective programming, which 
is to find a minimum cost of object with satisfying some QoS constraints. 

In this paper, we focus on two metrics namely throughput and delay. Under stationary 
condition, intuitively, our classified synthesized routing QoS mechanism has larger 
throughput than BRST and smaller delay than LBCPP. Under custom dynamic condition 
with adding background flows through the shortest path at the time of 50 s and removing 
the flows at the time of 110 s, we compare the throughput and delay between the two 
unclassified single routings as shown in Figs. 4 and 5. When the background flows inject 
in the network, we could find that the throughput with BRST collapsed obviously, while 
the other algorithm of LBCPP shows a higher throughput at same time. Similarly, the 
latency of the BRST increases more evidently than the other. Therefore, from H2 to H4, 
the results show that the proposed routing mechanism could reduce the delay for prior 
mice-flows and increase the throughput for inferior elephant-flows. In the future, we 
will employ the other dynamic condition with injecting the background traffic subjected 
to Poisson distribution or Normal distribution. 
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5 Conclusion and Future Work 


Based on the fusion of techniques, this work presents an intelligent perceptive QoS 
mechanism for forwarding the diverse flows with the adaptive routing. The architecture 
consists of five modules which are collaborating together on SDN controller. First, it 
finds the feasible classification among flows according to the requirements of the band- 
width. Second, for the delay sensitive flows, the theory is made with bandwidth reser- 
vation for finding the shortest path in order to guarantee the delay demand and reduce 
packet loss rate caused by propagation in network. Third, for the bandwidth sensitive 
flows, the multipath polling method of load balance considered multipath is applied to 
improve the network performance. Finally, it updates an adjustable traffic classification 
threshold so as to keep optimal for both network performance and service quality. 

In the future work, we would employ and exploit other computing algorithms to 
implement the proposed mechanism. In the paper, the machine learning technology 
named K-means clustering is used to classify the flows by finding the most suitable 
dividing metrics for the previous period. Since the flow classification named 2-means 
may not be efficient, other skills like reinforcement learning would be used to find the 
optimal adaptive dividing metrics with real-time feedback. Moreover, in the updating 
module, the predicted method could be exploited to get the potential growing tendency 
of traffic features for more appropriate classification. In addition, the machine learning 
could also be applied in the controller analysis, admission control, and packet security. 
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Abstract. Aiming at the complex large-scale shortest path problem, a modified 
genetic learning algorithm has been offered. Firstly, in order to prevent premature 
convergence of evolution population, a new fitness function for the shortest path 
is proposed. At the same time, to ensure the global search ability of the algorithm 
a selection method of crossover probability is proposed. And to ensure the local 
search ability of genetic algorithm a selection method of mutation probability is 
proposed. The experimental results show that the modified genetic algorithm has 
a better success rate compared with the traditional algorithm. 


Keywords: Genetic algorithm - Shortest path - Fitness function 
Crossover operator - Mutation operator 


1 Introduction 


The shortest path problem has been widely employed in many fields, such as intelligent 
transportation [1], computer networks [2], robot path planning [3, 4] and so on. For the 
shortest path problem, there are many deterministic algorithms, such as Dijkstra algo- 
rithm [5] and Ford algorithm [6]. However, in the practical application, the size of the 
shortest path problem is expanding, the constraint condition of the shortest path problem 
is increasing, and the complexity of the influencing factors is also increasing. The above 
leads to the difficulty of solving the shortest path is also increasing. In the traditional 
algorithm, it is difficult to effectively find a perfect solution to quickly find the optimal 
solution or suboptimal solution. For the shortest path problem of large scale, the intel- 
ligent algorithm can satisfy the user’s time and precision conditions to obtain the optimal 
or suboptimal solution. 

Genetic algorithm (GA) has good adaptability, robustness and flexibility [7]. In 
this paper, we use genetic algorithm to solve the shortest path problem. But, there 
are some problems in genetic algorithm, such as premature convergence and weak 
local search ability, low efficiency and low precision. These problems are caused by 
the irrational design of population scale, the irrational design of fitness function and 
the irrational selection of cross probability and mutation probability. In order to solve 
the above problems, two kinds of fitness function transformation methods are 
designed, and the crossover probability and mutation probability selection strategies 
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are optimized. The effectiveness of the algorithm to deal with the shortest path 
problem is verified by simulation experiments. 


2 Genetic Algorithm 


The shortest path genetic algorithm consists of the following five parts: (1) the genetic 
representation of the shortest path problem, (2) the encoding and decoding of the shortest 
path problem, (3) designing an fitness function according to the merits of each chromo- 
some, (4) designing the genetic operators which used to change the genetic structure of 
offspring, usually include selection, crossover, mutation and so on, (5) parame- 
ters setting, including population size, probability of applying genetic operator, etc. The 
keys to solve the shortest path problem based on GA are the design of the fitness function, 
the setting of the crossover operator and the mutation operator. 

Genetic algorithm [8, 9] is an algorithm to simulate the biological evolution process 
to find the optimal solution. Its main advantage lies in its good adaptability, robustness 
and flexibility. First, as a kind of intelligent search algorithm, the optimization process 
of genetic algorithms do not have much mathematical requirements. The genetic algo- 
rithm is modeled according to the specific problem, and the solution is found by evolu- 
tion without having to consider the specific internal operation problem. So it has a good 
adaptability. And it borrows the evolutionary idea of biogenetics can carry out global 
search very efficiently. The computational complexity of genetic algorithm is small 


Initial generate 
population 


Calculate per 
Individual 
fitness value 















Didn't reach the specified 
Generation or precision 


Execute selection 
operation 


Get results Execute crossing 
operation 


Execute mutation 
operation 


Fig. 1. Genetic algorithm flow chart 
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Compared with the traditional algorithm. At the same time, the genetic algorithm can 
be hybridized with other related heuristic algorithms, which can be flexible to deal with 
practical problems. 

The basic process of genetic algorithm as shown in Fig. 1, including initial popula- 
tion, crossover, mutation, selection and so on. Individuals evolve in the direction of 
higher fitness by crossover, mutation and selection operations to find the optimal solution 
or suboptimal solution [10]. 


3 Optimization Design of Genetic Algorithm 


3.1 Genetic Representation 


Coding is a key factor to improve the efficiency and success rate of genetic algorithms. 
In this paper, we use the priority coding method to represent the shortest path problem. 
The advantage of the priority coding method is that any encoding has a corresponding 
path; most crossover operations are very easy to implement; there is no loop after 
crossing. 

Encoding: Randomly generate sequences to form the original chromosomes. The 
length is the total number of points in the graph. According to the coding pseudo-code, 
we can get the random sequence which named “chromosome”. The node ID in Fig. 2 is 
a random sequence containing 11 points. The specific encoding pseudo code is as 
Table 1. 


Table 1. Priority encoding pseudo code 
Encoding Pseudo Code: 
forG=1:to n) 
chromosome[i]=i; 
forG=1 to n/2) 
repeat 
q=random(1,n); 
p=random(1,n); 
until p!=q; 
swap(chromosome|[p],chromosome|[q]); 


end 
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Fig. 2. Priority decoding and encoding process 


Decoding: The decoding process similar to the depth traversal algorithm. Take the 
Fig. 2 as an example, firstly, according to the topology of the path, we can see that node 


Table 2. Priority encoding pseudo code 
Decoding Pseudo Code 
Initialize 1=1;l=1;path[1]=1; 
while S_1 != Ø 
temp_j=max(chromosome|j],j © S_1); 
if(chromosome[temp_j]!=0) 
path[l]=temp_j; 
l=14+1; 
chromosome[temp_j]=0; 
i=temp_J; 
else 
S_i=S_i\temp_j; // delete temp_j from S_i 
chromosome[i]=0; 
if(<1)l=1;break; 
1 = chromosome|]]; 


end 
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1’s son nodes are node 2, node 3 and node 4. In the node ID, We can get their value, ID 
[2] = 1, ID [3] = 10, ID [4] = 3. Secondly, choose the maximum value from the son 
nodes of node 1, and then change the path to 1-3. Thirdly, repeat the first two steps until 
find the shortest path. So we can get path 1-3-6-5-8-11 based on the node ID. Decoding 
method pseudo code is as Table 2. 


3.2 Fitness Function 


The fitness function determines the evolution direction of the GA process. And it has a 
great effect on the convergence speed of a GA process. In the paper, our coding method 
of the shortest path algorithm containing the designated points is the priority coding 
method, so a chromosome can be successfully decoded corresponds to an actual path. 
A path is a correct solution when the number of designated points in this path meet the 
requirements. Therefore, we sets the fitness function to the number of the designated 
points in the path. The direction of evolution is the direction of the path that contains 
the number of designated points must be more and more. The fitness function does not 
use the weight of the path as a reference, the above selection way may cause a larger 
weight to be included in the found feasible solutions. For the above defects, the algorithm 
is used to compare the many feasible solutions, and the optimal solution is chosen as 
the global optimal solution. 


3.2.1 Exponential Transformation 
Ol 
a = 4/t/ fag +£), m=141g(T) = 

Where f(i) represents the ith chromosome’s original fitness value, f (i) represents 
the new fitness value of f(i) to adapt to exponential transformation, œ is an exponential 
transformation coefficient which is a positive number of dynamic changes that gradually 
increase as evolutionary generation increases, t is the current chromosome evolutionary 
generation, T is the largest genetic generation, f,,, is the average fitness which is the 
average value of f(i), € is a positive number which small enough. 

According to the analysis of the fitness function, we can see that f (i)is a non-negative 
number. In the process of population evolution, f}, in the early evolution is very large 
and @ is small. With the evolution of the population, f,» gradually reduced, and the 
evolution of the current generation f is increased. According to this trend, the transfor- 
mation of fitness function can be guaranteed a gradually increase. Therefore, the fitness 
function is an adaptive dynamic adjustment function. 


3.2.2 Logistic Curve Function Transformation Method 


The expression of logistic curve function is y = To According to the logistics curve 
e 


in Fig. 3, it can be seen that the function range is between 0 and 1. At the same time, the 
function value is between 0.5 and 1 when the value of x is less than 0, and the function 
value is between the range of 0 and 0.5 when the value of x is larger than 0. 
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Fig. 3. Logistic curve 


As can be seen from Fig. 3 the logistic function gets a value of either 1 or a value of 
O when the value of x is outside the [—10, 10]. If the logistic function is used as fitness 
function directly, the fitness value of many chromosomes will be close to 1 or 0, so that 
majority of the chromosomes in the population are less competitive which will lead to 
precocious phenomenon. In order to avoid the premature phenomenon, a new fitness 
function is designed for the logistics function. 


l 
Lap AR =O 


I + exp(f — fave) 


g > 30%n 


F= (2) 
g < 30%n 


where f*(i)is a new fitness value of the ith chromosome obtained by logistic function 
transformation, f(z) 1s the original fitness value of the ith chromosome, f,,, 1s the average 
value of the current population f(z), g indicates the number of chromosomes where the 
difference outside range (—10,10), c is the order of magnitude of the maximum value of 
f = avg ! 

The above two kinds of fitness function to meet the general fitness function necessary 
conditions. At the same time, relative to other fitness function, they have the following 
properties: (1) the fitness function is a nonnegative number, so the objective function 
can be regarded directly as the original fitness function, regardless of its positive or 
negative; (2) the original fitness function value is inversely proportional to the value of 
the new fitness function, that is, the greater the value of the original fitness function the 
smaller the value of the new fitness function; (3) to ensure that most of the chromosomes 
in the population are highly competitive, and the probability of premature appearance 
is reduced. 


3.3 Crossover Operator 


In the genetic algorithm, the crossover operator has a great effect on the global searching 
ability and convergence ability. In this paper, we presents a new cross probability setting 
method. Firstly, in order to prevent the destruction of the excellent individual gene in 
the population, we make the excellent individual directly as the next generation of indi- 
viduals. In this paper, we assume that the first 30% chromosomes in the population are 
the excellent individuals. Secondly, in order to reduce invalid crossover, the remaining 
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chromosomes adopt the non-equal probability pairing strategy, we use the correlation 
function in Formula 3 to calculate the correlation between chromosomes. A higher 
crossover probability is set for two chromosomes with uncorrelated correlations. 

First of all, we selected a chromosome X that had not been operated by crossover, 
and then calculated the correlation index between X and other non-overlapping chro- 
mosomes. According to the correlation index, the probability of crossover between the 
selected chromosome and X was calculated. The roulette wheel method is chosen as the 
selection operator. So, we can get the chromosomes that cross with X. 


a. Correlation index function 


ii 0,x,=y. 
r(x, Y) = 2a D y, X Dy; = f “i 7 (3) 


j=l 


Where Y, is a chromosome that has not been crossed, m is the total number of genes 
in the two chromosome which has the larger number of genes, r(X, Y,) is indicated that 
the number of genes is not the same between X and Y, the greater the r(X, Y,), the smaller 
correlation between X, and Y,, and the smaller the probability of invalid crossover oper- 
ation will be reduced. 


b. The crossover probability of chromosome Y, selected to cross with chromosome X 


1 r(X, Y;) — Vie 
poy = 2 (1a) re tha (4) 
m max — Fave 
Where Fag = £ > r(X, Y;), Tmax = max { r(X, Vat = ly 2ps ,m}, 
M j=] 
‘min = min{r(X, Vt = Me cas ,m}, Ais a constant, and O < ÀA < 1. 


3.4 Mutation Operator 


The mutation operator is an auxiliary method, and it has an effect on the local search 
ability of genetic algorithm. Therefore, the algorithm perform a mutation on a small 
fraction of the chromosomes. The Formula 5 is used to calculate the mutation probability 
of each chromosome. We sorted the individuals from small to large by mutation prob- 
ability. The first 15% individuals are selected to mutation, thus is, the mutation proba- 
bility set to 15%. 


Pmi f ag 


Pm = (Pmi — Pm )kı (fmax =) St AT (5) 
Pm — a a aa 
fmax ET 
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Where p,,, = 0.1, p,,. = 0.001, k, is a constant, and O < k; < 1, fmax 1s the largest 
fitness value in the population, f,,, 1s the average fitness value of population, f is the new 
fitness value of the chromosome. 


4 Experimental Analysis 


4.1 The Influence of the Crossover Operator 


In order to verify the influence of the crossover operator, the improved genetic algorithm 
using the new crossover operator is compared with the traditional genetic algorithm. 
The results in Table 3 are the average of 100 tests. According to Table 3, it can be seen 
that the new crossover operator can improve the accuracy of the genetic algorithm, and 
can effectively avoid the algorithm into the local optimal solution. 


Table 3. Performance comparison of improved GA and traditional GA 


Algorithm Average convergence generations | Optimal solution ratio 


Traditional GA | 78 0.61 
Improved GA | 42 0.94 







4.2 The Modified Genetic Algorithm Analysis 


In order to verify the validity of the algorithm, the Dijstra algorithm is compared with 
the improved genetic algorithm. In Table 5, the optimal solution and the suboptimal 
solution represent the average of the success cases, and the time represents the average 
time of the cases. The genetic algorithm parameter settings: population size of about 
1000 which slightly fluctuate for specific cases, mutation probability is 15%. 

As can be seen from Tables 4 and 5, the genetic algorithm has a better success rate 
than Dijkstra algorithm, and is more effective under large-scale data. Although the 
probability of genetic algorithm to get the optimal solution is 100% in theory. Because 
it is difficult to meet the individual diversity in the population in practical application, 
the success rate cannot reach 100%. 


Table 4. Dijkstra algorithm simulation results 


Nodes | Number of designated Number of success/number | Traversal completion time 
E of o 


20 45 < ms 


50 E D” us 3 ms 
100 J15 |% 31 ms 
200 J15 J810 90 ms 
300 (20 |710 320 ms 
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Table 5. Genetic algorithm simulation results 


Nodes Suboptimal solution time/ 
points optimal solution optimal solution time 
20 1/2 11 ms/11 ms 
50 1/3 52 ms/63 ms 
100 3/23 325 ms/3 s 
200 6/41 325 ms/36 s 
300 21/83 42 s/203 s 
500 56/127 167 s/361 s 


From the experimental results, we can see that the genetic algorithm has a high 
computational cost compared with the traditional algorithm. The improved genetic 
algorithm is much more expensive than Dijkstra algorithm, especially the optimal solu- 
tion cost. For the above-mentioned problem, we can adopt pruning strategy. First, we 
cut the size of the graph and reduce the size of the graph. And then, we can use the 
genetic algorithm mentioned in this paper in the results of the pruning. So that we can 
reduce the computational cost in the case of guaranteed algorithm search capabilities. 


5 Concluding 


The shortest path problem is an active research area. In this paper, we modify the fitness 
function of genetic algorithm, and then design a new selection method for crossover 
probability and a new selection method for mutation probability. These improvements 
make the genetic algorithm have a good success rate. At the same time, the algorithm 
proposed in this paper can easily combine with the pruning algorithm. So that we can 
effectively reduce the computational complexity and guarantee the search ability of the 
shortest path problem. However, this paper only considers the constraint condition of 
designated points, the algorithm also has room for improvement. 
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Abstract. To enable device compatibility, interoperability and integra- 
tion in the Internet of Things (IoT), several ontological frameworks have 
been developed, using the Semantic Web technologies — a common and 
widely-adopted toolkit for addressing the heterogeneity issues in com- 
plex I'T systems. ‘These ontologies aim to provide a common vocabulary 
of terms to be universally adopted by the IoT community. Defined using 
the Web Ontology Language — a language underpinned by the Descrip- 
tion Logics — these vocabularies, however, seem to neglect the automated 
reasoning support, which comes along with this semantic approach to 
model IoT environments. ‘To bridge this gap, this paper builds upon the 
existing work in the area of semantic modelling for the IoT, and proposes 
utilising IoT ontologies to define and enforce policies, thus benefiting 
from the built-in support for automated reasoning. 
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1 Introduction 


While challenges associated with timely processing of sensor data have been 
relatively successfully tackled by the advances in networking and hardware tech- 
nologies, the challenge of properly handling data representation and semantics of 
IoT descriptions and sensor observations is still pressing. In the presence of mul- 
tiple organisations for standardisation, as well as various hardware and software 
vendors, overcoming the resulting heterogeneity remains one of the major con- 
cerns for the IoT community. Moreover, apart from the syntactic heterogeneity 
(i.e. heterogeneity in the data representation, such as, for example, differences in 
data formats/encodings), we can also distinguish heterogeneity in the semantics 
of the data |2|. For example, different units of measurement and metric systems 
are common causes for incompatibility and integrity problems. 
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As a potential workaround to the IoT heterogeneity problem and yet-to- 
come standards, the research community started investigating potential ways 
of creating semantic modelling languages, which would provide an overarching 
modelling framework and bridge formerly-disjoint heterogeneous IoT systems. 
Such modelling languages can be seen as common vocabularies of terms, which 
are expected to be used by IoT practitioners to enable compatibility and interop- 
erability when integrating IoT solutions. Simply put, two disjoint system would 
be able to ‘communicate’ to each other by expressing their data and interfaces, 
using a common suitable modelling language. A representative example of how 
this lack of unified data representation has been addressed in the context of 
the IoT is the Semantic Sensor Web — a promising combination of two research 
areas, the Semantic Web and the Sensor Web [6]. Using the Semantic Web 
technology stack to represent data in a uniform and homogeneous manner, it 
provides enhanced meaning for sensor descriptions and observations so as to 
facilitate data analysis and situation awareness [3]. One of the main outcomes 
of this initiative was the Semantic Sensor Network (SSN) ontology — a thorough 
modelling vocabulary, jointly developed by a wide group of researchers. 

From this perspective, however, Semantic Web ontologies do not differ much 
from other modelling approaches (e.g. UML or XML) that provide a taxonomy of 
terms and relationships, to be used as ‘building blocks’ when describing the IoT 
domain. A major advantage of the Semantic Web languages, which is frequently 
neglected in this context, is the support for automated formal analysis, under- 
pinned by the built-in reasoning capabilities of the Web Ontology Language 
(OWL), which is one of the key enabling technology for the Semantic Web [4]. 
By representing data in terms of OWL classes and properties, one can perform 
reasoning tasks over this data and benefit from an already existing, highly- 
optimised and reliable analysis mechanism. In this light, this paper is trying to 
tap into this idle potential for automated reasoning, and presents an approach 
to policy management and enforcement in the IoT context, using existing IoT 
ontologies and corresponding reasoning support. As it will be further described 
below, the proposed approach is expected to benefit from separation of concerns, 
extensibility, human readability, as well as increased reliability, automation, and 
opportunities for reuse. 








2 Background: Semantic Web Languages 


The Semantic Web [1] is the extension of the World Wide Web that enables peo- 
ple to share content beyond the boundaries of applications and websites. This 
is typically achieved through the inclusion of semantic content in web pages, 
which thereby converts the existing Web, dominated by unstructured and semi- 
structured documents, into a web of meaningful machine-readable information. 
The Semantic Web is realised through the combination of certain key technolo- 
gies [4], whereas the presented research specifically focuses on the Web Ontology 
Language (OWL) and the Semantic Web Rule Languages (SWRL) as the two 
potential ways of defining and enforcing policies in the context of the IoT. 
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OWL is a family of knowledge representation languages used to formally 
define an ontology — “a formal, explicit specification of a shared conceptuali- 
sation” [7]. Typically, an ontology is seen as a combination of a terminology 
component (i.e. TBox) and an assertion component (i.e. ABox), which are used 
to describe two different types of statements in ontologies. The TBox contains 
definitions of classes and properties, whereas the ABox contains definitions of 
instances of those classes. Together, the TBox and the ABox constitute the 
knowledge base of an ontology. OWL provides advanced constructs to describe 
resources on the Semantic Web. This way, it is possible to explicitly and formally 
define knowledge (i.e. concepts, relations, properties, instances, etc.) and basic 
rules in order to reason about this knowledge. OWL allows stating additional 
constraints, such as cardinality, restrictions of values, or characteristics of prop- 
erties such as transitivity. OWL languages are characterised by formal semantics 
— they are based on Description Logics (DLs) and thus bring reasoning power 
to the Semantic Web. SWRL extends OWL with even more expressiveness, as 
it allows defining rules in the form of implication between an antecedent (body) 
and consequent (head). It means that whenever the conditions specified in the 
body of a rule hold, then the conditions specified in the head must also hold. 

To date, a number of IoT ontologies have been developed using OWL. More 
specifically, IoT-Lite ontology’ is a lightweight instantiation of the SSN ontol- 
ogy, actively developed by the World Wide Web Consortium. It describes the key 
IoT concepts to allow interoperability and discovery of sensory data in heteroge- 
neous IoT platforms. This ontology reduces the complexity of other IoT models 
by describing only the main concepts of the IoT domain. This means that fol- 
lowing the Semantic Web principles of linking and reusing existing ontologies 
and datasets, it is possible to extend the core vocabulary with other relevant 
concepts, defined in other ontologies if/when needed. This way, ontology engi- 
neers can simply import an existing, established, and trusted ontology, instead 
of ‘re-inventing the wheel’. 





3 Sample Scenario 


The following use case scenario demonstrates the proposed idea of leveraging 
the idle potential of OWL ontologies and focuses on a complex IoT system com- 
posed of multiple sensing devices, deployed both indoors and outdoors. Some 
of these devices are temperature sensors installed in rooms within a building. 
It is assumed that whenever any of these temperature sensors indicates a value 
exceeding a dangerous level of 50°, the situation has to be classified as critical, 
and thus needs taking reactive actions. A possible way to handle this scenario 
would be to define explicit policies for every single temperature sensor within 
the building. In the worst case, such policies would be either ‘hard-coded’ with 
numerous if/then operators (i.e. any modifications would lead to the source 
code recompilation), or defined declaratively (i.e. stored in some kind of con- 
figuration file to be dynamically fetched by the analysis component). In both 
cases, however, the resulting knowledge base would be saturated by the excessive 








l https: //www.w3.org/Submission /iot-lite/. 
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number of hardly manageable and possibly conflicting policies. An alternative 
solution is based on using the reasoning capabilities of OWL and SWRL to clas- 
sify observed IoT data as instances of specific classes. More specifically, with this 
use case we demonstrate three different types of automated classification:? 

— Defined classes: underpinned by the DLs, OWL allows creating so-called 
defined classes via a set of necessary and sufficient conditions. This means that 
the reasoner will classify any entity with a required set of sufficient properties 
as an instance of a specific class, even if this class membership was not defined 
explicitly. The defined class if RoomDevice is demonstrated in Listing 1.1, which 
should be read as “if sd is a sensing device and has a rectangular coverage area, 
then sd is a device installed in a room”. 





Listing 1.1. Defined OWL class RoomDevice. 


ssn:SensingDevice(?sd) AND iot:Rectangle(?r) 
AND iot:hasCoverage(?d,?r) = 
iot :RoomDevice(?sd) 


— OWL subclasses: OWL also provides a simpler and more explicit way of 
defining classes and subclass relationships. It supports multiple and transitive 
inheritance, and, as in many other programming languages, subclasses inherit 
all the properties of their parent classes. The code snippet in Listing 1.2 contains 
two definitions. The first definition simply states that any temperature sensor 
is a sensing device. The second definition illustrates the transitive inheritance 
through a subclass hierarchy that states — in simple words — that if a device is 
installed in a room, then it is automatically assumed to be installed in a building 
as well, which in turn means it is an indoor device. 


Listing 1.2. Defining OWL subclass relationships. 


iot:Temperaturesensor IS A ssn:SensingDevice 
iot:RoomDevice IS A iot:BuildingDevice IS A iot:IndoorDevice 


— SWRL classification: SWRL allows defining more expressive rules, as illus- 
trated by Listing 1.3. In simple words, the code snippet reads that if there is 
an indoor device id, indicating that its measured value has exceeded 50°, the 
current observation has to be classified as critical. 


Listing 1.3. Defining the class CriticalObservation using SWRL. 


iot:IndoorDevice(?id) AND ssn:Observation(?0) AND dul:Value(?v) 
AND iot:observes(?id,?0) AND ssn:hasValue(?0, ?v) 
AND swrl:greaterThan(?v, 50) THEN 

iot :Critical0bservation(7o) 


2 The notations ssn, iot, and dul are established shortcuts for imported OWL ontolo- 
gies, where corresponding concepts are defined. 
SSN ontology (ssn): http://purl.oclc.org/NET/ssnx/ssn. 
IoT-Lite ontology (iot): http://purl.oclc.org/NET/UNIS/fiware/iot-lite DOLCE 
Ontology (dul): http://loa.istc.cnr.it /ontologies /DOLCE-Lite.owl. 
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Taking together all three definitions, we now assume that there is a temperature 
sensor ts reporting a temperature level of 60° in its covered rectangular area. 
The automated reasoner will follow the steps below: 


1. Since TemperatureSensor is a subclass of SensingDevice, ts is classified as 
a SensingDevice (Listing 1.2). 

2. Since ts is a SensingDevice and has a rectangular coverage area, it is clas- 
sified as a RoomDevice (Listing 1.1). 

3. Since RoomDevice is a subclass of BuildingDevice, which is a subclass of 
IndoorDevice, ts is classified as an instance of IndoorDevice (Listing 1.2). 

4. Since the measured observation of an IndoorDevice ts is greater than 50°, 
this observation is classified as CriticalObservation (Listing 1.3). 








This way, the reasoning engine is able to detect a critical situation by inferring 
implicit facts from the limited explicit information. It is worth noting that using 
generic rules for a wide range of devices, as in the example above, does not 
affect the flexibility of the proposed approach and its ability to define fine- 
erained, targeted policies for individual devices. Apart from inheritance, OWL 
also supports overwriting parent properties by subclasses. This means that it 
is possible to enforce device-specific policies, which will overwrite the default 
behaviour and apply only to those specific devices. This way, flexible policy 
enforcement at various granularity levels can be achieved. 





4 Discussion and Conclusion 





The presented approach discusses the idle potential of existing IoT ontologies 
to be used in the context of policy enforcement mechanisms. Semantic Web is 
underpinned by Description Logics, which offer automated reasoning support. 
This means that IoT engineers can use existing ontological classes and properties 
to define policies, and, as a result, benefit from the built-in policy enforcement 
mechanisms. More specifically, the following benefits can be identified [2]: 

— Separation of concerns: desirably, a policy enforcement mechanism is expected 
to (i) separate the knowledge base and policies from the actual enforcement of 
these policies, and (iz) allow the definition of the knowledge base in a declarative, 
loosely-coupled manner. The presented approach addresses these requirements 
and enables policy modifications ‘on the fly’ in a seamless, transparent manner 
— that is, without recompiling, redeploying and restarting the whole software 
system. In particular, to apply changes, it is only required to add corresponding 
OWL/SWRL rules to the policy base. As opposed to the existing (potentially 
hard-coded) approaches, the declarative approach is seen as a considerable ben- 
efit, which enables minimum interruptions caused by potential modifications. 
— Human readability and ease of use: the Semantic Web research targets at 
making information on the Web to be both human- and machine-readable, with 
languages which are characterised by an easy-to-understand syntax, as well as 
the visual editors for effortless and straight-forward knowledge engineering. OWL 





188 R. Dautov and S. Distefano 


ontologies are known to be used in a wide range of scientific domains (for exam- 
ple, see [5] for an overview of biomedical ontologies), which are not necessarily 
closely connected to computer science, and allows even for non-professional pro- 
grammers (i.e. domain specialists) to design policies. 

— Extensibility: IoT systems may be composed of an extreme number of smart 
devices spread over a large area (e.g. traffic sensors distributed across a city- 
wide road network) and have the capacity to be easily extended (as modern 
cities continue to grow in size, more and more sensors are being deployed to 
support their associated traffic surveillance requirements). To address this scal- 
ability issue, the proposed semantic approach, suing the declarative definition, 
can extend the knowledge base to integrate newly-added devices in a seamless, 
transparent, and non-blocking manner. The same applies to the reverse process 
— once old services are retired and do not need to be considered anymore, the 
corresponding policies can be seamlessly removed from the knowledge base, so 
as not to overload the reasoning processes. 

— Increase in reuse, automation and reliability: policy enforcement mechanisms 
already exist in the form of automated reasoners for OWL/SWRL languages, 
and the proposed approach aims to build on these capabilities. Since the reason- 
ing process is automated and performed by an existing reasoning engine, it is 
expected to be free from so-called ‘human factors’ and more reliable, assuming, 
of course, the validity of ontologies and policies. Arguably, as the policy base 
grows in size and complexity, its accurate and prompt maintenance becomes a 
pressing concern so as to avoid potential policy conflicts. 








References 


1. Berners-Lee, T., Hendler, J., Lassila, O., et al.: The semantic web. Sci. Am. 284(5), 
28-37 (2001) 

2. Dautov, R., Kourtesis, D., Paraskakis, I., Stannett, M.: Addressing self-management 
in cloud platforms: a semantic sensor web approach. In: Proceedings of the 2013 
International Workshop on Hot topics in Cloud Services, pp. 11-18. ACM (2013) 

3. Dautov, R., Paraskakis, I., Stannett, M.: Towards a framework for monitoring cloud 
application platforms as sensor networks. Cluster Comput. 17(4), 1203-1213 (2014) 

4. Hitzler, P., Krotzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies. 
Chapman & Hall/CRC, Boca Raton (2009) 

5. Rubin, D.L., Shah, N.H., Noy, N.F.: Biomedical ontologies: a functional perspective. 
Briefings Bioinf. 9(1), 75—90 (2008) 

6. Sheth, A., Henson, C., Sahoo, S.S.: Semantic sensor web. IEEE Internet Comput. 
12(4), 78-83 (2008) 

7. Studer, R., Benjamins, V.R., Fensel, D.: Knowledge engineering: Principles and 
methods. Data Knowl. Eng. 25(1—2), 161-197 (1998) 





Network Traffic Prediction Based on Wavelet Transform 
and Genetic Algorithm 


Xuehui Zhao, Wanbo Zheng”, Lei Ding, and Xingang Zhang 


College of Computer Science and Technology, Jilin University, Changchun, China 
442014097@qq.com, zwb@jlu.edu.cn 


Abstract. The traditional network traffic prediction is based on the establishment 
of a linear model, which can’t describe the changes of network traffic accurately, 
resulting in low prediction accuracy. This paper proposes a new model of network 
traffic prediction based on wavelet transform and Genetic Algorithm. Firstly, after 
a wavelet decomposition, network traffic is turned into many stable components. 
Secondly, using BP neural network to predict each stable component, and opti- 
mizing neural network by genetic algorithm. Finally, all the prediction of compo- 
nents is combined to achieve highly-accurate traffic prediction. The experimental 
results show that the model has a better predictive effect. 
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1 Introduction 


Network traffic data is a kind of time-series data. The analysis of traffic data can help 
people know the running status of the network, which can do it more reasonably in 
bandwidth allocation, control flow, routing control, and error control [1]. Thus, it is an 
effective way to improve QoS (Quality of Service). 

Today’s network traffic presents some characteristics as nonlinear, abrupt and non- 
stationary. The first-generation network traffic prediction models, such as Markov 
model, Poisson model, AR model and ARMA model, can’t describe the non-stationary 
characteristics of traffic. Meanwhile the prediction accuracy is too low to be suitable for 
the current network traffic prediction [1]. 

Later, a second-generation traffic prediction model, such as FARIMA, gray model, 
neural network model, wavelet model, and support vector machine (SVM), which can 
capture long-run and non-linear flow characteristics [2]. But there are still some short- 
comings, for instance, FARIMA can’t describe the non-stationary characteristics of 
traffic; neural network model requires more training samples, and the algorithm is 
complex; gray model only applies in the occasions which original sequence changes not 
fast, according to the exponential law. 

In order to overcome the limitation of single model and describe the characteristics 
of network data more accurately, some hybrid models have been proposed, such as gray 
neural network model [3], wavelet combined with neural network model [4], wavelet 
combined with time series model [5], and more complex combination of wavelet, neural 
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network and time series model [6] etc. In some cases, the complex combination of model 
methods may not reduce the forecast error of component. Although the method improves 
the accuracy of prediction, the complexity will increase at the same time. This model is 
suitable for occasions that require higher prediction accuracy and less time. 

In this paper, a traffic prediction model combining wavelet transform and neural 
network is proposed and optimized by genetic algorithm. Firstly, process the traffic data 
by the wavelet decomposition, then the time series data will be decomposed into rela- 
tively simple components, for smoothing the original signal. Secondly, the BP neural 
network is used to predict the high frequency components and low frequency compo- 
nents respectively. Considering the shortcomings of slow convergence and local opti- 
mization, the genetic algorithm with good global search ability is used to optimize the 
weights and thresholds. Finally, the component is reconstructed to obtain the final 
prediction result. Experiments show that wavelet decomposition can effectively increase 
the signal stability, and BP neural network for non-linear changes in traffic has a better 
prediction effect. Meanwhile, the introduction of genetic algorithms accelerates the BP 
neural network convergence rate, and improve the prediction accuracy. 


2 Technical Overview 


2.1 Wavelet Decomposition and Reconstruction 


Wavelet decomposition is proposed by Meyer and Mallat. Time series data are decom- 
posed into two parts: low frequency and high frequency coefficient. 

The decomposition of time series is achieved by Mallat algorithm, and the decom- 
position relationship is as follows: 


a,,=hh* a 
ree d j=0,1,... (1) 

In the formula, ho represents a low-pass decomposition filter; h; represents a high- 
pass analysis filter; * represents a convolution operator; a; represents the low-frequency 
coefficients; d; represents frequency coefficients. When j = 0, the original time series ag 
through hp and h,, after several decomposition, can be decomposed into the low 
frequency and high frequency coefficients of the original time series. 

The wavelet reconstruction relations are as follows: 


A =g x a . 
j 0 "2 {=0,1,... 2 
tat d? (2) 


In the formula, gy and g; represents a low-pass reconstruction filter and a high-pass 
reconstruction filter respectively; A; represents a low-frequency component; D; repre- 
sents a high-frequency component. 

Thus, the relation between the original time series S and A,, D; is as follows: 


piety (3) 
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2.2 BP Neural Network 


BP neural network is a hierarchical feed-forward network, and the training algorithm is 
back-propagation algorithm (referred to BP algorithm). This is a supervised learning 
method, and the basic idea is the least-squares algorithm. Using root mean square error 
(referred to RMSE) and gradient descent method to correct the network connection 
weight. The purpose is to minimize the RMSE between the actual and the specified 
output [9]. The principle is shown in Fig. 1. 


Back-propagation (learning algorithms) 
7 7 
7 7 







Expected Output 
+F Vector 


: (Instructor 
C+ Signal) 


Input layer Output layer 


Hidden layer 





Signal flow 


Fig. 1. BP neural network 


The number of hidden layer neurons can be determined by the formula 
m= V/l+n-+a, and ais an integer between [1, 10]. 


2.3 Genetic Algorithm 


The genetic algorithm (referred to GA) treats possible solutions in the problem space as 
chromosome individuals in the population, and encodes each individual into a symbol 
string. According to the fitness function, the value is calculated and then evolved from 
generation to generation. After simulating biological evolution, selecting, crossover, 
mutation and so on, the optimal solution [5] is obtained. 

This paper discusses the weight optimization, including the following steps: 
1. population initialization 


Individuals are encoded with real numbers, and their encoding lengths are the sum 
of the number of network ownership values and the number of thresholds: 


S=Ilxm+mxn+m+n (4) 


In the formula, S is the code length, and 1, m, n are the number of neurons in the 
input layer, hidden layer, and output layer respectively. 


2. fitness function 


F = k(}) abs; — 0;)) (5) 
i=1 
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In the formula, n is the number of output node in the network output layer, k is the 
coefficient used to adjust the range of fitness values, y; 1s the expected output of the ith 
node of BP neural network, 0; is the ith node prediction output. 


3. selecting 


We use the roulette method based on individual fitness selection strategy. 


p =— 
F, È 


x 
ja Fi; 





(6) 


In the formula, P; is the selection probability of each individual, k is coefficient, n is 
the number of individuals, F; is the fitness value of individual 1. 


4. crossover 


In this paper, the real cross method, that is, the kth chromosome a, and the Ith chro- 
mosome 4a, in the j-bit cross operation method is as follows: 


ay; = a,,1 — b) + a,b 7) 
a, =a,(1 — b) + a;b 


In the formula, b is a random number between [0,1]. 


5. mutation 


M f age (ai; — amax) X f(g) r = 0.5 
; a, + (Amin — aj) X f(g) r< 0.5 (8) 


f(g) = %U — g/Gmax) 


In the formula, a; is to the jth gene in the ith individual, a,,,, is the upper bound of 
the gene aij, amin 18 the lower bound of the gene aj;, r2 is a random number, g is the current 
number of iterations, Gmax 18 the maximum evolution number, r is a random number 
between [0,1]. 


3 Design of Traffic Prediction Model Based on Wavelet 
Decomposition and Genetic Algorithm 


3.1 Model Ideas 


The network traffic data is self-similar (single fractal) on a large time scale, while multi- 
fractals appears on a smaller scale [10]. The model in this paper firstly processes the 
flow data by wavelet decomposition, and the prediction model is composed of BP neural 
network, as shown in Fig. 2. Genetic algorithm is introduced to optimize the weights 
and thresholds of neural networks [11]. 
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Fig. 2. Schematic diagram of network traffic prediction model 


After refactoring , 
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To optimize the BP neural network by genetic algorithm, the specific process is 
shown in Fig. 3: 

























































































: Initialize the Input data Calculate the 
Determine i : ; 
weights .thres Encoding the fitness value Selection, 
network ae 
elec hold length initial value (BP neural crossover and 
P of BP neural by GA network mutation 
structure nee 
network ee training error) 
Calculation Get the 
Prediction | Satisfy the eit error, update optimal vd pak Calculate 
Results conditions weight and weight Pe fitness value 
threshold threshold 
N———_4 N 





Fig. 3. Flow of GA optimizing BP neural network 


3.2 Model Steps 


Step 1: 


Step 2: 
Step 3: 


Step 4: 


Step 5: 


Step 6: 


Wavelet decomposition. The wavelet transform is performed on the input traffic 
data by the formula (1) Mallet algorithm. By using the Wavelet Toolbox in 
MATLAB, the decomposition level is L, the low-frequency and high-frequency 
components with stationary features at time k are obtained, as {D,(k), D2(k)... 
D,(k), Ay (k)}. The original signal S = D; + D, + D; +...D, + Az; 

Initialize BP1, BP2, BP3...BP(L + 1) neural network. They have three layers: 
input layer has 24 neurons, single hidden layer, output layer has 12 neurons; 
Preprocessing A;(k), D,(k), D»(k)...D (Ck), and the samples of each neural 
network are constructed for training; 

Training neural network BP1 optimized by genetic algorithm. A; (k) is the input 
of BP1 neural network, A; (k + T) at k + T time is the expected value of the 
network. The training process is shown in Fig. 2; 

Training BP2, BP3...BP(L + 1) optimized by genetic algorithm. D,(k) is the 
input of BP2 neural network, D,(k + T) is the expected value at time k + T, 
D,.(k) is the input of BP3, D.(k + T) is the expected value at time k + T, and 
so on. 

Input the test data flow, combined L + 1 component (high-frequency and low- 
frequency components) obtained from neural network prediction, and 
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comparing the actual results. Among them, the error of component prediction 
is not only positive but negative. After combination, part of the error can be 
offset. 


4 Network Traffic Prediction Model Implementation 


4.1 Wavelet Decomposition 


In this paper, network traffic data comes from the CERNET backbone network center 
in Northeast China, which is offered by China Education and Research Network. Using 
the Shenyang-Changchun 30 days of data (from April 31, 2016 to May 29, 2016), as 
shown in Fig. 4. Network traffic is in units of GB, and selecting 2 h as the acquisition 
time granularity, that is, 12 points a day, 30 days a total of 360 points. 





0 50 100 150 200 250 300 350 data points400 
Fig. 4. Graph of real traffic data 


For the data flow in Fig. 4, wavelet transform is performed using the formula (1) (2). 
Selecting db4 as Wavelet base, decomposition scale L = 5, the high-frequency and low- 
frequency component are {D, (k), D; (k), D3 (k), D; (k), D; (k), A; (k)} and the original 
signal S = D; + D, + D; + D; + D; + As, as the results shown in Fig. 5. 
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Fig. 5. Data signal after wavelet transform 


4.2 Preprocessing of Traffic Data Sequence 


In order to speed up the convergence rate of neural network and improve the prediction 
accuracy, the normalized network flow data after wavelet decomposition is processed 
[12]. The formula is: 


X = 


, A= A 
| min | (9) 
in 


Xmax — 4m 


The normalized data is within the range of [0,1], Xmax and X,,;, are the maximum 
and minimum values respectively in the network traffic data set, X’ is the normalized 
value, X is the original value of the variable. 


4.3 The Sample Construction of Neural Network 


X; j represents the network traffic on the ith day atj time, X; represents the network traffic 
on the ith day. The network traffic data is X = {(X1 1, Xy9, ..., Xin), (X21 X22, <--> Xan), 
-o (Xn, Xm, 2- Xmyn)» --- $3 The specified k learning samples is P = ((Xj, Xo,..., Xj); 
(Xo, X3,..., Xi) (Ky, Xk+- Xk+j-1)); The corresponding k teaching samples 
T = (X41; Xj42;..-5 Xj. The purpose of learning is to correct the weight, using the 
error between the neural network output corresponding to the k learning samples P4, P>, 
..., P, and the corresponding teacher samples T,, T>,..., Tk, so that the output is close 
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to the expected teacher sample. X;,; is a seasonal time series with j as the cycle, T; is the 
teacher sample, the relationship between the two is T; = Xj,j. 

In this paper, m = 30, n = 12 and n = 2, the input layer of network training uses 24 
neurons, the flow of 2d is mapped respectively; the output layer uses 12 neurons, 
mapping the future Id flow. 


4.4 Training Neural Networks 


After several debugging experiments, BP1 neural network is determined as a 
24 x 12 x 12 three-layer structure, and the excitation functions of input layer, hidden 
layer and output layer are purelin; BP2 is a 24 x 16 x 12 three-layer structure, and the 
excitation functions are tansig, tansig, purelin; BP3, BP4, BP5 is a 24 x 10 x 12 three- 
layer structure, and the excitation functions are tansig, tansig, purelin; BP6 neural 
network is a 24 x 16 x 12 three-layer structure, and excitation function, training function 
is the same as BP3. The training functions of these neural networks are trainlm, the 
maximum training times are 2000, the neural network target error is le” the learning 
rate is set to 0.3, and the rest is the default value. 

In this paper, GAOT is used to program the genetic algorithm. The flow of the algo- 
rithm is shown in Fig. 3, and the objective function is performed according to the fitness 
function, the formula (5). Population size is set to 50, genetic algebra is 100, real coding, 
mutation is the probability of 0.09, and the rest is the default value. 


5 Traffic Prediction and Analysis of Results 


The data of the first 20d(days) were selected for training, and the data of the last 10d(the 
last 120 points) were used for testing. The flow time series of 30d is: 


X= fti = Cane vee Apo) = (Geba vee oa. se t30 = (Coa, Gos vee fos) k 


The learning samples: P = {(t,, t2);(t2, t3);...;(tig; tio)}, and the corresponding 
teaching samples: T = {tz, t4,..., too}. The network is predicted after learning, using tio 
(network traffic on day 19) and ty) (network traffic on day 20) to predict traffic on day 
21, and using tọ and tz; to predict traffic on day 22, and so on. 

Figures 6 and 7 show the results of the short-term prediction of the last 120 data 1 
day in advance. Figure 6 is a comparison of the new model without genetic algorithm 
(GA) optimization, and Fig. 7 is the comparison chart of new model after GA optimi- 
zation. The abscissa is the detection point and the ordinate is network traffic (GB). 
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Fig. 7. GA optimized fitting figure of 1d short-term prediction 


The prediction results of Fig. 6 shows that the new model has a higher prediction 
accuracy and a better degree of fit. Figure 7 shows that the new model is satisfactory. 
To clarify the performance of the new model, the results of the model performance 
parameters using GA or not are listed in Table 1. 


Table 1. Parameters of model performance 


Model R-square 


New model without GA (1d) | 3.3507 | 1.2670 | 1.8305 | 0.9027 
New model with GA (1d) 2.6736 | 1.2583 | 1.6351 | 0.9680 
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MSE is the mean square error, MAE is the absolute error, RMSE is the root mean 
Square error, and R-square is the determination coefficient. 

From Table 1, we can see that the MSE of the new model is reduced by 20.21%, 
comparing the new model without the optimization, and the prediction accuracy is 
improved. Meanwhile, the coefficient of determination of the new model is increased 
by 7.23%, which improves the fitting degree. This indicates that genetic algorithm is 
feasible and effective to optimize the new model. 

Although the genetic algorithm needs time to run, it speeds up the neural network 
convergence rate, and improve the prediction accuracy. The comparison of training 
times and running time are shown in Table 2. 


Table 2. Compare training times and running time 


The name of New model with GA optimization 
nel nee Running time(s) 
BPI 35 20.56 

BP2 16.97 

BP3 10 10.22 

BP4 foo 882 8e an 

BPS 10 12.71 

BP6 a2 737 |6 jos 


The above results prove the validity of using genetic algorithm. The new model can 
predict the traffic of t}; and t, according to tio, ty) and so on. The prediction results are 
shown in Fig. 8. 


with GA 
20 optimizatio 
real traffic 





Fig. 8. GA optimized fitting figure of 2d short-term prediction 


Using new model with GA optimization to realize the short-term forecast 3 days in 
advance, as shown in Fig. 9. 
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Fig. 9. GA optimized fitting figure of 3d short-term prediction 


The parameters of the new model performance with GA optimization are listed, 
which is 2d and 3d predicted in advance, as shown in Table 3. 


Table 3. Parameters of model performance 






Model RMSE | R-square 
New model with GA (2d) 3.8539 | 1.5591 | 1.9631 | 0.8476 


New model with GA (3d) 4.7562 | 1.6605 | 2.1809 | 0.8119 


From Tables 2 and 3, we can see that the new model has higher accuracy and better 
fitting degree in short-term prediction. Although the fitting of the 3d prediction curve of 
Fig. 9 is not superior to Figs. 7 and 8, the trend of the flow rate can be roughly described, 
with little effect. 


6 Conclusion 


The results of MATLAB simulation show that the network traffic prediction model based 
on wavelet decomposition and genetic algorithm has good accuracy and it is an effective 
prediction model, which is mainly embodied in: 


1. After wavelet decomposition, the original signal becomes more simple and has good 
stability, which provides the stability for BP neural network prediction; 

2. The traditional time series analysis method is difficult to ensure high accuracy. For 
non-linear traffic flow, BP neural network prediction is better; 

3. Genetic algorithm can improve the defects of BP neural networks, which can easily 
fall into the local optimal solution and the convergence rate. 
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However, the model in this paper is rather complicated. The high-frequency and low- 
frequency components obtained by wavelet decomposition are predicted by neural 
network, and the computational cost of prediction is large, which affects efficiency of 
the model. Therefore, in the case of the high accuracy, to further improve the efficiency 
of the model is the future research direction of this paper. 
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Abstract. The paper is to observe Weighted Least-Connection Algorithm, a 
default one of load balancing scheduling algorithms, and point out its defi- 
ciencies by establishing Linux Virtual System under Centos 7. Then we also 
improve the factors that influence weight based on above. During the experi- 
ment, not only is the throughput of system increased, but also the responding 
time shortened. It optimizes the performance and improves the stability of the 
whole system consequently in brief. 


Keywords: Load balance - Scheduling - Cluster - WLC algorithm 
Throughput rate 


1 Preface 


With present information on the Internet is growing at a geometric rate, parallel 
architecture technique of distribute web cluster service is also growing more mature. 
Most service vendors establish cluster system through Linux, which can effectively 
settle many concurrent requests to balance load and fix bugs in time [1]. This text 
relates to the latest released version of Centos 7, which manages and assigns tasks by 
using default load balance schedule of LVS and ip_vs module. 

The management applications abroad at the earliest were obsoleted to settle with 
unbalanced load among multi servers, because deployment of each machine was 
verbose, such as Zeus [2]. After that, load balance production based on software and 
hardware appeared, like Barracuda and Load Directors, which started early on load 
technique issue and maintained leading position. 

LVS was a free software project. It started and studied by Wen-song Zhang, a 
doctor of National University of Defense Technology. It covered more than ten kinds 
of load scheduling algorithms with IP load balance technique mainly. Therefore, LVS 
came first on the list of load management system for cluster as well as had great fame 
home and abroad with characters of scalability, reliability and manageability [3, 4]. 

In the study of Weighted Least-Connect Scheduling, we thought that ratio between 
client connections and own weight for every server could not measure load capacity 
accurately. Therefore, we introduced extra factors such as CPU, memory, hard disk, etc. 
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Most web service involved internet I/O resource, which also reflected part of real load for 
each server. We should consider all those factors together above and make a rational 
division of upcoming task for load balancer to enhance performance of the whole system. 
In next section, we introduce the traditional WLC algorithm and the improved version. 


2 Load Balancing Scheduling Algorithm 


There are three ways such as VS/NAT, VS/TUN and VS/DR to realize LVS under IP 
load balancing technique. VS/DR could avoid load balancer being the system bottle- 
neck. It could also reduce the complexity of configuration sometimes. The following is 
the VS/DR architecture diagram of this experiment, as shown in Fig. 1. 
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Fig. 1. VS/DR architecture 


2.1 Traditional Weighted Least-Connection Scheduling Algorithm 


More than ten traditional load balance-scheduling algorithms are built-in for Centos 7, 
most of which use Weighted Least-Connection Scheduling as default [5]. The schedule 
is the superset of the Least-Connection one. The server Si (i = 0, J..., n) uses weight W 
(Si) as its performance with default 1, uses C(Si) as the current number of connections 


and use Csum as the XC(S7). The condition as shown in formula (1) indicates a new 
task assigned to Sm next. 


C(Sm)/Csum — . C(Si)/Csum 
Wem "O Wwe (1) 


As Csum from each side is extremely similar and efficiency of multiplication 
exceed division, then we could simplify it to formula (2) [6]. 


C(Sm) x W(Si) > C(Si) x W (Sm) 2) 
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WLC’s aim is to find the server Si in server pool with the minimal value of CW, 
and then assign the new task to it. However, it is not reliable by just using the 
connections and weight as we discussed above. Administrator with much experience 
must change the value of weight manually during the whole procedure. Therefore, we 
introduce and adopt the improved WLC algorithm in next section. 


2.2 Improved WLC Algorithm 


The parameters such as CPU, memory and disk of servers are introduced to estimate 
their real load and update weight automatically in the improved version. L(Si) repre- 
sents load seizure rate, P(Si) represents performance of hardware, F(Si) indicates 
performance of network. P(Si) and F(S7) should be concerned together to explain whole 
performance better. Usually, P(Si) which indicates for hardware would change little 
with increasing pressure, So we could assume it as static variable and send it to load 
balancer once. However, F(Si) dynamically change by throughput rate and TTLB 
factors, it could be send back with L(Si) together. The calculate formula is shown as 
below [7]. 


L(Si) = à * Lepu(Si) + Az x Lmem(Si) + dg * Ldisk(Si) (3) 
P(Si) = u, x Pcpu(Si) + u, * Pmem(Si) + u, x Pdisk(Si) (4) 
F (Si) = y, * Fttlb(Si) + y> * Ftr(Si) (5) 


The A, u, y represent the percentage of each item, whose range in (0, 1) and sum 
is 1. We could assign them specific values into the above formulas to reduce the 
complexity and benefit for control variables. Therefore, load weight of server could be 
estimated as formula (6). W(Si) is proportional to L(Si) while it is in inverse propor- 
tional to F(Si). Load balancer collects L(Si) and F(S7) of each server while new time 
slice AT begins [8]. 


= P(Si) x F(Si) (6) 


In order to improve fault tolerance and avoid server becoming overload during 
AT while a new task comes, we put forward the concept of load redundancy to indicate 
extra load that server could hold [9, 10]. Load redundancy represents the max extra 
capacity within a slice of At as formula (7) [5]. 


nisi) = Z re x At S 


At’s range is expected to be [5, 10] seconds. On the one hand, Aż to be small would 
bring extra computation frequently, on the other hand, it would be too large to reflect 
the redundancy timely. Load balancer also needs to set an initial minimal redundancy 
threshold Rmin and a binary sort tree, only the R(Si) which is equal or greater than 
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threshold could be added in that data structure. Every node of the tree has chance to 
receive new task, but we optimally choose the largest value of R(Si), which means we 
choose the smallest W(Si) with less load pressure [5, 12]. 

P(Si) and F(Si) could be estimated by software in next section. L(S7) has three 
factors which change as time goes on, here are intuitive ways to get these values 
respectively below. 


(1) Real-time CPU utilization rate considers differences within two close moment from 
the file “/proc/stat” on a Linux server, the content shows many kinds of time slices 
whose unit are jiffies. Formula (8) shows totalTime for CPU. 


totalTime = user + nice + system + idle + iowait + irq + softirg + stealstolen + guest (8) 


We set totalTime as T1, idle to I1. Then after a period of a very short time, we 
calculate totalTime, idle again and set them to 72, I2 respectively. Finally, formula (9) 
shows the CPU utilization rate. 


M7) 211) 


Lepu(Si) = T3 TI (9) 


(2) The memory information of Centos 7 is stored in “/proc/meminfo”. Formula (10) 
shows the free memory of Linux, and formula (11) shows the occupancy rate of 
memory which MemTotal indicates the total memory size. 


Mfree = Memfree + Buffers + Cached (10) 


. MemTotal — Mfree 
Lmem(Si) = To (11) 


(3) We can install “iotop” software to check whole I/O occupancy rate under Linux. 
The command “iotop-o” shows all processes/threads which include I/O operations. 
We focus the column named “IO>” and assign the sum to Ldisk(Si). 


Ldisk(Si) = 5 10(tid) (12) 
tid=1 


The communication between load balancer and servers use TCP socket [11]. 
Server, a TCP client, sends message L(Si), R(Si) and F(Si) regularly but P(Si) only 
once. Load balancer, a TCP server, receives values of each server and calculates new 
W(Si) and R(S7). If R(Si) > Rmin, we put this server into the tree. If R(Si) < Rmin, then 
we remove this server out of the tree [5]. Traversing the tree in order ensures the 
smaller node we pick up can deal with a new task easier. 
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2.3 Result of Experiment 


We use various kinds of software such as “AIDA64”, “CPU-Z”, “Web Application 
Stress Tool” to estimate P(S7) with scores and F(Si) like Figs. 2 and 3 [12]. The 
experiment imitates the operations of login, add, refresh, and quit in sequence on the 
bookmarking website we build before which was written by PHP. 
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Fig. 2. AIDA64 Fig. 3. Web application stress tool 
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Fig. 4. TTLB average response time Fig. 5. Bytes received rate 


Firstly, we assign à1 = 0.4, A2 = 0.3, A3 = 0.3, ul = 0.5, u2 = 0.4, u3 = 0.1, 
yl =0.4, y2=0.6 and At=10s. Then we collect average response time and 
throughput with increasing number of requests. Figures 4 and 5 show the trend lines. 

The improved algorithm when request quantity after 271 is superior to the tradi- 
tional in Fig. 4. Frequent calculation and update operations need extra time in the early 
period, but it shorts the response time with more rational strategy to assign tasks. The 
throughput increases first and decrease later in Fig. 5, the improved is superior the 
traditional after about 267. It reaches maximum at about 410 because the system 
achieves saturation gradually with increasing requests. 
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Secondly, we discuss about the factors À, y in L(Si) and F(Si) with the same 800 
requests respectively as Tables 1 and 2 below. 








Table 1. A factor Table 2. y factor 
y1 F(Si) 
SCF) =e 
1894.36 | 2017.12 | 2398.40 | 2658.32 0.5 963.57 
673.90 





| 0.3 | 2271.24 | 2475.67 | 2825.21 | = 
| 0.6 | 3093.45 | 331704 | | 
Coe | 3772) — | a | a 





Table 1 shows changes of TTLB while providing À with different percent. 11 has 
more influence than A2 and (1 — à1 — A2), so CPU may have the highest priority. It is 
similar in Table 2 with y. Setting appropriate factors can improve the situation [13]. 
Boundary values could not generate reliable result. 

Finally, according to these analyses above, the improved algorithm lifts about 
15-18% of load balancing performance in cluster environment. It is benefit for 
improving stability and efficiency of the whole system. 


3 Conclusion 


This paper introduces the approach to enhance performance of cluster system and offers 
better concurrency service after improving Weighted Least-Connection Scheduling. 
Estimating weights reasonably and filtrating statistical data for delivering task accel- 
erate procedure of response smoothly. The improved one contributes to science 
experiment and business such as neural network, big data analysis, e-commerce, virtual 
host service, etc. However, there are still many problems to solve [1, 12], such as 
understanding more influenced factors fully, improving speed, etc. We will break the 
limitation and strong the algorithm with the development of technology in the future. 
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Abstract. A good base station deployment plan can help network operators save 
cost and increase total revenue significantly under the premise of ensuring 
network quality. But in the past, base station location planning is often manually 
based on the engineer’s experience. It has a lower efficiency and very high error 
rate. In this paper, a new method based on genetic algorithm is proposed to opti- 
mize base station location. In our work, a CAD system based Google Earth and 
ACIS is designed to provide data for Genetic algorithm and display the location 
of base station in the reconstructed terrain. This system which takes three-dimen- 
sional geographic coordinates as the input of the algorithm is advanced and 
different from the traditional method which only uses two-dimensional coordi- 
nates, that is, this three-dimensional system can better display the base station 
location and take the height into consideration. The proposed method is based on 
a mathematical model of base station location. Genetic Algorithm is used to find 
the solution of this model so that it can effectively reduce the error rate of base 
station location. 
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1 Introduction 


Excellent base station location plan for operators can help to get the best results in the 
mobile communications market in the increasingly fierce competition [1]. Early plan 
and design of the base station is mainly by the network optimization engineer based on 
experience and field measurements to choose [3, 4]. Obviously, this method is not very 
scientific. Over-reliance on subjective factors, led to the results often far from the actual 
optimum configuration. Telecommunication base station automatic programming CAD 
system based Genetic Algorithm proposed in this paper is based on Windows platform. 

The process of solve the problem of base station planning as follows. First, use 
Microsoft visual studio, Xtreme Toolkit Pro, ACIS/HOOPS to build the development 
environment. Second, using com technology import Google earth into the development 
environment and accessing and transforming Google earth coordinates which include 
longitude, latitude, and elevation. Third, use ACIS/HOOPS technology to reconstruct 
Three-Dimensional Terrain according to the transformed coordinates. At the same time, 
establish a multi-objective mathematical model about base station planning according 
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to geographic information and produce base station primary scheme using genetic algo- 
rithm. At last, it uses ACIS/HOOPS modeling techniques to display the primary scheme 
on the reconstructed terrain 1. Structure of the process is shown in Fig. 1: 
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Fig. 1. Flow chart of base station location optimization 


2 Terrain Reconstruction by Google Earth and ACIS Technology 


ACIS is a development platform based on C++ structure graphics system, Developers 
can use these classes and functions construct a Three-Dimensional software system for 
end users. Google Earth provides a set of COM components that can be embedded into 
visual studio development platform by the COM technology [2]. Through the secondary 
development of Google Earth, you can get the actual terrain’s height, latitude and longi- 
tude information that was shown in Fig. 2 which can be converted into a three-dimen- 
sional coordinate that is shown in Fig. 3 of the screen of the CAD system so that it can 
simplify the data and facilitate the calculation when the transformed coordinates of the 
screen as the input of algorithm. After running the optimization algorithm, the optimal 
position of the output of the base station can be converted to the actual elevation, latitude 
and longitude coordinates. 


The extracted coordinates.txt transformed coordinates.txt me odraded coordinates Dd transformed coordinates.bt x 


x< 


0316 d=130. 
.0323 d=130. 
-0329 d=130.5 
-0336 d=130.5 
.0344 d=130. 
-0351 d=130. 
d=130.5 

d=130. 

d=130. 
d=130. 
8 d=130.5 
5 d=130.5 

d=130.5 
d=130. 
d=130.5 


we 


wo 


+ Q i AOT j 
36 . 49790193 Jlo N 
> A 


< 


o 
of 
9 
° 
o 
oye) 


< 


a a a a 
gO oOo O00 00 OC OF 
90 oo 00 oO 


w e 
lw Ww U 

~ Ww 
www uw UW 


WO æ o 


Wl [i ll li Il ll [i ll Il ll [i 


rip Th OT TH Te Th Tr T 
mi TL It Tt Mm TL YT 


1 
1 
1 
1 
1 
1 
1 
13 


hn 


> a a a a a a a a a a a a 4 
< 


oo oo co 00 oo oo oo 


Se ee ee ee 
Se oo ooocdodddcdd dO Oo © 
SN vo ovo oo oo ooo OC © OD OO oO 


4 
A¢ 
4 
84 
84 
84e 
84e 
84 
84e 
84E 
84e 
84 
84 
84 
84€ 


Se vo oo oo ooo o cd OC OC OO O&O 
SSodSSSSS5aSaa55 


No O00 OO O&O 
` ` A UJ U | lu lu lw 


h © © 


hk. fh on 
STO 00 OD O&O 
~ 
| 


Rett (Chia, 
het feet heal. 
moun 


oo ww 


x 





B 
D 
> 
D 
B 
a 
D 
> 
D 
= 
D 
= 
D 
Dp 
D 
B 
B 
B 


<x 


Fig. 2. The extracted actual coordinates Fig. 3. The transformed screen coordinates 
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The transformed screen coordinates can be used to reconstruct the terrain by ACIS 
technology with api_face_spl_apprx, api_face_plane and api_loft_faces interface. The 
reconstructed terrain was shown in the Fig. 4, the actual terrain was shown in the Fig. 5. 
The reconstructed terrain was used to show the plan of base station location in Three- 
dimensional 3. 





Fig. 4. The actual terrain Fig. 5. The reconstructed terrain 


3 The Mathematical Model of Base Station Location 


3.1 Coverage Rate Statement and Formulation 


Coverage rate is calculated by using the number of demand points which are covered, 
divided by the total number of the demand points of the terrain points [4]. To avoid a 
repeat count of demand points, we just put the demand point included in the base station’s 
coverage area which has the nearest distance from the demand point compared to other 
base stations. To formulate coverage targets we defined as follows: U 1s the set of all 
demand points; j is the demand points which j € U, Tis the set of all base stations; 7 is 
the base station whichi € T. d; is the distance between base station i and demand point 
J, r;is coverage radius of base station i, the coverage area of base station i, denoted by 
C, is defined as the collection of demand points which is within the coverage area of i 
and has the shortest distance from i among all the base stations. Formally stated, C; is 
given by: 
C; = {j€ Uld; <r,Ad, < dp YW ETAF i) (1) 
The ultimate goal of optimizing the coverage is achieve full coverage of all user 
points by using the minimum number of base stations. 
In the actual experiment, we defined for as a function of the base station coverage 
goals. The total number of all demand points covered by all base stations is given 


by ))ier C, The total number of all demand points deployed on the terrain is denoted 
by IUI. When fer = 1, all demand points in the terrain are fully covered. However, a 
fcr = 0.95 is quite acceptable. Although a fer = 1 is too costly to achieve, fer is 
maximized in the model. fer is defined as follows: 
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Dier C; 


Maxfcr = “TU 


(2) 


3.2 Cost Statement 


Cost is a very important objective optimization goals, the investment of base station 
planning is about two-thirds of the total investment of the whole network. Therefore, a 
lower cost of base station planning can effectively reduce the total network investment 
so that it can enhance the competitiveness of the enterprise communication. The cost of 
the base station planning is mainly from the built of base stations. In condition of each 
base station has a fixed cost, the number of base stations becomes the most important 
optimization goals. 

In order to prevent the waste of the base station, the number of base station we 
calculated by divide the terrain area by the area of the base station. There are three 
different radii of base stations, so the number of base stations in the interval [20, 30] 
inner. By experiment, when the number is 25, the cost is optimal. 


4 Model Solving Based on Genetic Algorithm 


Genetic algorithm is one of the four main branches of evolutionary computation. It is 
also the main evolutionary algorithm developed rapidly in the last ten years. Genetic 
algorithm with evolutionary strategy, evolutionary programming and genetic program- 
ming had been rapid development and gradually to integration, and formed a new 
computational theory of simulated evolution [5, 6]. 

In the genetic algorithm, simulate the evolution of biological processes, chromo- 
somes or individuals of the population perform crossover and mutation operations [7]. 
The implementation of basic genetic manipulation need to use the selection, crossover 
and mutation of the three types of genetic operators. Thus, the genetic algorithm is also 
considered as a random search algorithm. But it is also a process through iterative opti- 
mization, with self-adapting characteristics. 

In the process of solve the mathematic model by Genetic Algorithm, we use floating 
point encoding to encod X; = {Xj Xj, Xj3 t + Xi3m-2> Xi3m—1> Xi3m}(m = |TS|) is the 
encoding of i-th chromosome of GA algorithm (X; 3,95 Xi 3m—1> Xi3m) E X; k = 1... m) 
is the three-dimensional coordinates of k-th base station. F = feris the objective function 
of entire system, so F(X,) is the fitness value of i-th chromosome, That is a measure of 
the solution of deployment of m base stations in the planning region. The fitness value 
is closer to 1, indicating that the program closer to the optimal. Algorithm flow chart is 
shown in Fig. 6. 
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Fig. 6. Genetic Algorithm flow chart 


In the CAD system, we design a dialog to run the Genetic Algorithm which is shown 
in Fig. 7. The maximum Iterations input box is used to set the maximum number of 
iterations. Color setting input box is used to set the color of base station. First we design 
the maximum iterations is 100. The Genetic Algorithm didn’t up to the optimal result 
and stopped at the 100th generation. The result which the base station’s color is green 
is shown in Fig. 8. Second we design the maximum iterations is 1000. The Genetic 
Algorithm up to the optimal result and stopped at the 523th generation. The result which 
the base station’s color is red is shown in Fig. 8. The cycle is the signal of each base 
station. 





Fig. 7. Genetic Algorithm dialog Fig. 8. The plan of base station location (Color 
figure online) 


The iterative process of the Genetic Algorithm is shown in Fig. 9, we use 25 base 
station and iterated 300 times, the coverage rate is as shown in the Fig. 9. 
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Fig. 9. Iterative process of the Genetic Algorithm 


5 Conclusions 


In this paper, we proposed a new method based genetic algorithm to make base station 
location plan. At this time, we developed a CAD system based Google Earth and ACIS 
to provide data for Genetic Algorithm and demonstrate base station location plan in the 
reconstructed terrain which is Three-dimensional. 

The future work will continue research on improving traditional intelligent optimi- 
zation algorithm and implement it on a three-dimensional CAD system. Meanwhile, the 
mathematic model of base station plan will be further design. The genetic algorithm will 
be improved. 
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Abstract. Cloud computing is becoming more and more popular and attracts 
considerable attention. In the there-tire cloud environment, an important prob- 
lem is to determine the optimal multi-server system configuration so that the 
profit of the service provider can be maximized. In related work, the maximum 
allowed waiting time of service is assumed to be a constant, and the rental price 
is also assumed to be constant for all servers despite the fact that different 
servers have different execution speeds. These assumptions may not be valid in 
realistic cloud environments. In this paper, we propose an optimization model to 
determine the optimal configuration of the multi-server system. There are two 
major differences of the proposed model with that of the existing work. First, the 
maximum allowed waiting time is not a constant and may change with different 
service requests. Second, the situation that the servers with different execution 
speed may have different rental prices is taken into account. Experiments are 
carried out to verify the performance of the proposed optimization model. The 
results show that the proposed optimization model can help the service provider 
gain more profit than the existing work. 


Keywords: Multi-server system - Profit maximization 
Maximum allowed waiting time > Response time - Rental price 


1 Introduction 


Cloud computing is becoming more and more popular and attracts considerable atten- 
tion [5]. In a cloud computing environment, there are three tiers, i.e., infrastructure 
providers, service providers, and consumers (see Fig. 1) [1, 2]. An infrastructure pro- 
vider maintains the basic hardware and software resources. A service provider could rent 
a certain scale of software and hardware resources from infrastructure providers and 
provides services to consumers. Consumers can submit their service requests to the 
service provider and pay them based on the quantity and the quality of the services. In 
above cloud computing environment, the problem of optimal multi-server system 
configuration for profit maximization is introduced as a significant, new research topic in 
computer science [3—5]. The configuration of a multi-server system is characterized by 
two basic features, 1.e., the size of the multi-server system (the number of rented servers) 
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and the execution speed of the multi-server system (execution speed of the rented 
servers) [3]. The key issue of the multi-server system configuration problem is to 
determine the optimal size and execution speed of the multi-server system such that the 
service provider can gain the maximum profit. 


Consumers Service provider Infrastructure provider 


an ale 
Sap a 
8“ 


Fig. 1. The three tiers cloud environment. 


Like any other business, the profit is the most important issue to a service provider. 
The service provider’s profit is determined by two parts, 1.e., the gained revenue and 
the corresponding cost, the revenue minus the cost is its profit. For a service provider, 
the revenue is the sum of the charge to the consumers, and the cost is the rental cost 
paid to the infrastructure providers plus the electricity cost caused by energy con- 
sumption. The charge to a consumer is related to the quantity and the quality of the 
service, that is, if the quality of the service is guaranteed, the service is fully charged, 
otherwise, if the quality of the service request is lower than the promised Quality of 
Service (QoS), the service provider serve the service request for free as a penalty of low 
service quality [3, 6, 7]. The electricity cost is linearly proportional to the number of the 
servers and to the square of the server speed [8, 9]. 

Many existing works have been done on the problem of multi-server system 
configuration for profit maximization in the literature [3—5]. Cao et al. [3] proposed a 
single renting scheme to configure the multi-server system. Using this scheme, the 
servers in the multi-server system are all long-term rented servers, so the system is lack 
of elasticity and easily leads to resource waste and consumer loss. To overcome this 
weakness, Mei et al. [4] proposed a combined renting scheme. Using combined server 
renting scheme, the main computing capacity is provided by long-term rented servers, 
and the rest is provided by short-term rented servers. In this service system, when a 
consumer submits a service request, the service request will be first put into a queue 
and wait in the queue until it can be served by any server. But in order to satisfy the 
QoS requirement, the waiting time of each service request in the queue must be limited 
within a certain range, which is named the maximum allowed waiting time and is 
determined by the service lever agreement (SLA). To guarantee the QoS, the service 
request should be started no later than the maximum allowed waiting time. So, for a 
service request in the queue, if its waiting time has reached the maximum allowed 
waiting time, system temporarily rents a short-term server to provide service and the 
short-term server is freed when the service is finished. But in Mei et al.’s study, the 
maximum allowed waiting time for all service requests is deemed to be the same, that 
is, the service requests in the multi-server system have equal maximum allowed waiting 
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times. This assumption may not be realistic. Therefore, this paper takes into account the 
condition that different service requests may have different maximum allowed waiting 
time. 

In Mei et al.’s work [4], the servers with different execution speed is deemed to 
have the same rental price. Thus the rental cost in their work is only related to the 
number of the rented servers. But more realistically, the rental price of the server may 
depend on the execution speed. Hence, the rental cost is not only related to the number 
of rented servers but also the execution speed of the rented servers, i.e., the more (less, 
respectively) number of the rented server is, the more (less, respectively) rental cost is, 
also the faster (slower, respectively) execution speed of rented server is, the more (less, 
respectively) rental cost is. 

In this paper, the problem of optimal multi-server system configuration for profit 
maximization is studied as an optimization model. In this optimization model, the 
conditions that the maximum allowed waiting times may be variables and the rental 
price of the server may be changed with the execution speed of the rented server are 
taken into account. The contribution of the paper is summarized as follows. 

The maximum allowed waiting times of different service requests are deemed to be 
no longer a constant but random variables. 

The condition that the rental price for different execution speed server may be 
different is taken into consideration. And a rental price model is proposed. 

An optimization model is proposed to solve the problem of optimal multi-server 
system configuration for profit maximization. 

Experimental studies are conducted to verify the performance of the proposed 
optimization model. The results show that the proposed optimization model proposed 
in this paper can result in more profit than that of existing work. 

The rest of the paper is organized as follows. Section 2 presents the relevant models 
and rental scheme, including a three-tier cloud environment model, a multi-server 
system model, an energy consumption model and a combined renting scheme. In 
Sect. 3, an important probability used in this paper is calculated. Section 4 describes 
the rental price model. Section 5 establishes an optimization model to solve the 
problem of optimal multi-server system configuration for profit maximization. 
Section 6 verifies the performance of the proposed model through comparison with that 
of existing model. Finally, Sect. 7 concludes the work. 


2 The Cloud Environment, Models and Renting Scheme 


This section introduces three relevant models, namely a three-tier cloud environment 
model, a multi-server system model and an energy consumption model, and a com- 
bined renting scheme. 


2.1 Three-Tier Cloud Environment Model 


In this paper, we study the multi-server system configuration problem under three-tier 
cloud computing environment, in which there are three tiers, i.e., infrastructure provi- 
ders, service providers, and consumers [3, 4, 15], as shown in Fig. 1. The infrastructure 
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provider maintains the basic hardware and software resources. The service provider 
rents a certain scale of software and hardware resources from infrastructure providers 
and builds its service platform to provide services to consumers. Consumers submit their 
service requests to the service provider and pay based on the quantity and the quality of 
the services. The infrastructure provides two kinds of resources renting schemes, L.e., 
long-term renting and short-term renting, and the rental price per server per unit period 
of time with long-term renting is much cheaper than that with short-term renting [4]. For 
a service provider, the gained revenue is the sum of charge to consumers, and the cost 
involved is the rental cost paid to the infrastructure providers plus the electricity cost 
caused by energy consumption. Thus the profit is generated from the gap between the 
revenue and the cost [3, 5]. 


2.2  Miulti-server System Model 


The cloud service provider is considered to be a multi-server system with a service 
request queue [17]. The multi-server system consists of m long-term rented identical 
servers and it can extend by temporarily renting short-term servers from infrastructure 
provider [4], therefore, the multi-server system may have two parts, 1.e., long-term 
rented servers (the long-term part) and possible short-term rented servers (the short- 
term part). Each server in the multi-server system has an execution speed of s (unit: 
billion instructions per second) [4]. The long-term part of the multi-server system can 
be modeled by an M/M/m queuing system [5, 18]. 

This paper makes the following assumptions, which are adopted for the 
multi-server system. These assumption are also use in related studies [3, 4]. 


(1) Service requests arrive according to a Poisson process, with arrival rate / 
(measured by the number of service requests arrived per second). It means that 
the inter-arrival times are independent and identically distributed (1.1.d.) expo- 
nential random variables (r.v.’s) with mean 1/4. 

(11) The multi-server system maintains a queue with infinite capacity. 

(iii) Different service requests may have different service sizes (measured by the 
number of billion instructions), denoted by r; G € {1,2,3,...}), which are 1.1.d. 
exponential r.v.’s with parameter u. 

(iv) Each service request can only be served by one server that may be long-term or 
short-term rented. 

(v) The first-come-first-served (FCFS) queuing discipline is adopted. 


Denote by M(t) the total number of service requests that arrives during the time 
interval (0,t], with mean E|N(t)| = At. 

Denote by N,(t) and Nt) the number of the service requests that served by the 
short-term and long-term servers, respectively, during the time interval (0, t]. 

An arrived service request is put into the queue and waits in the queue until it can 
be handled by any available server. Then, the working process of the system can be 
modeled as Fig. 2. 
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multi-server system 


Fig. 2. The working process of the system. 


According to assumption (iii), the mean of r; is 7 = E(7;) =] (unit: billion 


= t] 


instructions), and the execution times of the services on the multi-server system, x; = 


(i € {1,2,3,...}), are iid. exponential r.v.’s, with mean x = E(x;) = r = X (unit: 
second). 

The system service intensity means the average percentage of time that a server of 
the multi-server system is busy. Denote by p the system service intensity, which can be 


given by 
OAT À 


EA 1 
Oe ae (1) 


According to [5, 10], p should be no more than 1, i.e., i < ms. 


Denote by W; the waiting time of service request i in the queue. The cumulative 
distribution function (c.d.f.) of W;, Fw,(t), can be derived from M/M/m queuing system 
theory [11], which is 








T 
Fy,(t) = 1 -—™~e mp 2 
where 
m |m—l1 k m l 
_ (mp) (mp) (mp) 
Tm = , i T (1 — p) 
m z m! p 


2.3 Energy Consumption Model 


The cost of a service provider consists of two major parts, i.e., the rental cost paid to the 
infrastructure provider and the electricity cost caused by energy consumption. The cost 
of energy consumption is determined by the electricity price and the amount of energy 
consumption. 

Denote by y the price of unit energy (unit: cents per Watt). The power consumption 
of modern processor can be divided into two parts, dynamic power P4 (unit: Watt) and 
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static power P“ (unit: Watt) [12]. The dynamic power model used in the paper is given 
by Eq. (3),which is also used in [3-5]. 


Prs" (3) 


When ¢ = 9.4192 and « = 2.0, the value of power consumption is close to the value of 
the Intel Pentium M processor [13]. Therefore, the power per unit period of time for a 
busy server is 


P=P,+P". (4) 


2.4 Combined Renting Scheme 


The renting scheme combines long-term renting with short-term renting for the 
multi-server system [4]. The main service capacity of the multi-server system is provided 
by long-term rented servers, due to the low rental price; and the rest service capacity is 
provided by short-term rented servers. Algorithm 1 shows the combined renting scheme. 








_ Algorithm1. Combined renting scheme 
1. A multi-server system with m servers and speed s is running and waiting for the events 
as follows 
2. A queue Q is initialized as empty 
3. Event - A service request arrives 
. Put it at the end of queue Q and records its maximum allowed waiting time and the 
timer starts counting its waiting time 
5. End Event 
6. Event - A server becomes idle 
7 
8 


EAN 


Search 1f the queue Q is empty 
if true then 
9. Wait for a new service request 


10. else 
11. Take the first service request from queue Q and assign it to the idle server 
12. end if 


13. end Event 
14. Event - The maximum allowed waiting time of a request is achieved 
15. Take this request from queue Q and rent a temporary server to execute the request 
16. End Event 
17. Event - A service on the temporary server is completed 
18. Release the temporary server 
_19. End Event 








Denote by W;” (unit: second) the maximum allowed waiting time of service request 
i in the queue before it is served. When a consumer submits a service request i to the 
system which is then put into the queue, the system records W;” and starts counting its 
waiting time W,. The requests are assigned and executed on the long-term rented servers 
according FCFS discipline. Once W; of service request i reaches W”, a temporary 


short-term server is rented from infrastructure provider to process the request [4]. The 
time that spending on renting activity is very short, so this time is ignored in this paper. 
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As the long-term part is modeled as an M/M/m queuing system, then the 
multi-server system builds on the combined renting scheme can be modeled as an 
M/M/m queuing system with impatient consumers [4]. Denote by pe(W;<W?") the 
steady-state probability that W; reaches W;” for service request i. Then, service request 
i will be served on a temporary short-term rented server according to the probability 
pe(W; <W"), and will be served on the long-term part according to the probability 
1 — pe(W; < W""). pe(W;< W'") can be calculated next section. 


3 pe(W;< W?") Under Variate Maximum Allowed Waiting 
Time 


In related work [4, 5, 7], W” for all 7 in the queue is assumed to be a constant. This 
assumption may not be realistic, such as for a service request i, the maximum allowed 
response time may be shorter than W?”, which will lead to lower QoS. So the condition 
that the W;" may change with different service requests is considered. 

In this paper, the QoS of service request i is reflected by the response time T; (unit: 
second). Denote by T;” (unit: second) the maximum allowed response time that service 
request i can tolerate before it is completed. T” is assumed to be linearly proportional 
to the service size r; namely T; x r;, 1.e., T? = d-rj;, where d is a positive constant 
and determined by the service-level agreement (SLA). It can be noted that the waiting 
time plus the execution time is the response time, 1.e., T; = W; +x;. Then 

W” = T” —x; = kr,, (5) 


l l 
where 


-B1 (6) 


S 





According to assumption (iii), the service sizes 7;’s are 1.1.d. exponential r.v.’s with 
the parameter u, so W” (i € {1,2,3,...}) are 1.1.d. exponential r.v.’s with the parameter 
u/k. The probability density function (p.d.f.) of W?” is 


fwr (t) = Z oH, (7) 


and the mean of W:” is k/u, where k is given by Eq. (6). 
According to [10], the probability of the event W; < W;” is: 


Pr(W; < W”) = [ Fy, (t)fwe (t)dt. (8) 


222 Z. Kang and B. Yang 


Substitute (2) and (7) into (8), and after some manipulation, we have 


Tm 


PST) = 1 en — p) +a 


(9) 


Therefore, the probability of complementary event of W; < W?", Pr(W; < W!"), is: 
yl 


PW <WP) = 1 PrWi<W") = ET (10) 


In Pr(W; < W”), all service requests, in spite of exceeding their maximum allowed 
waiting times, will be waiting in the queue. However, in the combined rented scheme 
multi-server system, the service request whose waiting time reaches its maximum 
allowed waiting time will be removed out of the queue and assigned to a temporary 
short-term server, which will reduce the waiting time of the following requests. 
Therefore, it will reduce the probability that the waiting time reaches maximum 
allowed waiting time of the following service requests. According to [11], 
pe(W; <W?) is as: 


(1 — p) Pr(Wi < Wi") 


e( Wi < Wi") = R > 
pel ) 1 — p Pr(W; < W”) 


(11) 


where p is given by (1), Pr(W; < W”) is given by (10). 

It can be seen from (5), (6), (10) and (11) that pe(W;<W?") is affected by d. 
Figure 3 illustrates the pe(W;< W?") with different d, where 1 = 5.99, u = 1, m = 6 
and s = 1 [4]. 


The probability that waiting time W, exceeding w” 





p WWM) 








Fig. 3. The probability of W; exceeding W”. 
4 Rental Price Model 


In this section, a new rental price model is described. In related work [4], Mei et al. 
deemed that the servers with different execution speed have the same rental price. Hence, 
they assume all long-term servers have the same rental price f (unit: cents per second) for 


A Study of Optimal Multi-server System Configuration 223 


each server, and all short-term servers have the same rental price y (unit: cents per second) 
for each server as well, where p <y. This assumption may not be realistic. 

In practice, the rental price for a server normally changes with the execution speed 
of the server, i.e., a server with higher execution speed normally has a higher rental 
price. In this paper, this condition is taken into account. Denote by sọ (unit: billion 
instructions per second) the baseline execution speed [3]. The rental price of a 
long-term rented server with execution speed so is Po (unit: cents per second), and the 
rental price of a short-term rented server with execution speed so iS Yo (unit: cents per 
second), where Po < Yọ. The infrastructure provider maintains some kinds of servers 
with different execution speed s. s can be either higher or lower than so, but s € S, 
where S is the set of the possible execution speed. The service provider can select any 
kind of server with the speed limited in S to rent. 

Assume that there are two relation expressions between execution speed and the rental 
price for long-term rented server and short-term rented server, respectively, as follows: 


O o 


y o [sN 

eee fee 13 

7o G) 9) 
Hence, the rental price of a long-term or short-term rented with the execution speed 
s can get from (12) and (13), respectively, as follows: 


B= Bo (=) - Bo g (14) 


s\* Y0 
= — = — . 15 
Y= Yo (= sî S ( ) 


Figure 4 shows the rental price p and y with different execution speed, where 
so = 1.0, Bp = 1.5 and yọ = 3.0. In Fig. 4, t=0.5 and t= 1 for (a) and (b), 
respectively. 





























(a) (b) 


Fig. 4. Rental prices p and y with different execution speeds s. 
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5 Optimal Multi-server System Configuration 


To determine the optimal configuration (m, s), the optimal number m and execution 
speed s of the servers in the multi-server system, such that the service provider can gain 
the maximum profit, an optimization model will be established in this section. 

Denote by pft) (unit: cents) and R(t) (unit: cents) the profit and the revenue of the 
multi-server system during the time interval (0, ¢|, respectively. Denote by C,(t) and 
C,(t) the cost of the multi-server system on long-term servers and the short-term servers 
respective during the time interval (0, f], including the rental cost and the electricity 
cost. Then the profit of multi-server system during the time interval (0, 7] can be 
calculated as: 


pf (t) = R(t) — Cs(t) — C(t). (16) 
The mean of pf(t) is 
Elpf(t)| = E[R(t)| — E[Cs(¢)| — E|Ci(¢)]. (17) 


According to [10], the follow proposition can be got. 


Proposition 1. {N (t)} and {N,(t)} are both Poisson processes having respective rate 
Ape(Wi<W?") and All — pe(W;<W?")|. Furthermore, the two processes are 
independent. 

According to Proposition 1, the mean of N,(t) is 


E|N,(t)]| = Ape(W; < W”) - t, (18) 
and the mean of N((f) is 
E|Ni(t)| = All — pe(Wi < W?")| -t. (19) 


As the combined rented scheme can guarantee the QoS to all service requests, in 
steady-state, the revenue during the time interval (0, t| can be represented as 


RQ) =o ar, (20) 


where a is a positive constant, which indicates the price per billion instructions (unit: 
cents per billion instructions). 
According to [10], R(t) is a compound Poisson process, with the mean 


E|R(t)| = adt- E(r;) = a (21) 
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The cost on the short-term rented servers in steady-state, C(t), is calculated as: 


Col) = Ey typ) -E n (22) 


where P is given by (4), y is given by (15), w is introduced in Sect. 2.3. 
According to (18) and (21), we have 








ytwP r ye n| = +YP 


EIC.()] = = 


—— )p.(W; < Wi" <W) t ` (23) 


The cost on the long-term rented servers in steady-state, C(t), is calculated as: 


Cı(t) = mbt + Wilt), (24) 
where 
Mr 
P(t) =m- (Hatt, +r) , (25) 


and ß is given by (14). In Eq. (25), P, is given by (3), and P* is introduced in Sect. 2.3. 
According to (19), we have 





EI A = tp. (SWP|at- Ee) = PASTE (26) 
then 
B ones “| 
=m: poeem A PitP't (27) 
=m-{{l _ pW SWF ZW) pe, + P* yt, 
furthermore, 
E(Ci(t)) = mBt+ Y - ELPA) a 
= m- {B+ Y| — pe W; <WF))pPa + P*]}t. 
Therefore, substitute (3), (21), (23) and (28) into (17), we have 
at yreP, oom 
E|pf(t)| = Ape( Wi < W?")t (29) 


=m: {P+ y0 — pe(Wi < Wi") pgs* + P*]}t. 
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Then 


E[pf(t}|_ aà yteP, — —_, 
—— ae Ape( Wi < W”) (30) 
—- m: {P + Y| — pe(Wi < W7")) pës” + P*]}. 


The key issue of the multi-server system configuration problem is to determine the 
optimal values of m and s of the servers so that the service provider can get the 
maximum profit. As the profit can calculated as follows: 

l t—1 P O 
N 
p T w — m o4 * 
=m: {Fs + Y[(1 — pe(Wi<W)")) pes" + P*]}. 
0 


The problem of optimal multi-server system configuration for profit maximization 
can be established as the following optimization model: 


Maximize pf(m,s) (32) 
Subject to 
bons (33) 
m 
mE {1,2,3,...}: (34) 
SES. (35) 


Equations (33)-(35) are the constraints of the established optimization model. 
Equation (33) indicates that the system service intensity is less than 1. Equation (34) 
denotes that the number of servers should be positive integer. The available execution 
speed levels of the server are limited in set S, which is indicated by Eq. (35). In the 
following section, some experiments are studied to testify the performance of we 
proposed optimization model. 


6 Experimental Study 


The proposed optimization model can be solved by any optimization algorithms, such 
as Genetic Algorithm, Ant Colony Algorithm, Simulated Annealing Algorithm, etc. In 
this paper, we use Genetic Algorithm to solve the optimization model and compare the 
results with those obtained from the optimization model in [4]. 

In [4], W}", p and y are all deemed to be constant. In order to increase the prac- 
ticality of the proposed optimization model than [4], the conditions that the maximum 
allowed waiting times of services are r.v.’s and the rental price changed with the 
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execution speed are taken into account. In order to distinguish the proposed opti- 
mization model and the compared model, the proposed model is named as new model 
and the compared model is named as old model in this paper. 

In order to use optimal algorithm to solve the new model and old model, it is 


necessary to get a closed-form expression of 7,,. In this paper, we use the same 
m—1 k 

closed-form expression as [3, 4], which is 5 py ~ e’"?, This expression is accurate 
k=0 ~; 

when m is not too small and s is not to large [14]. According to Stirling’s approxi- 


mation of m!, i.e., /2mm e)” [16], one closed-form expression of n, is given by 
1—p 
(1-p)V2mm(4)° +1 


above closed-form expression of T. 


Lg = In the following, we solve new model and old model based on 


6.1 Parameters Setting 


In the compared old model, 2 = 5.99, a = 15, P* = 3, c = 0.3, « = 2.0, € = 9.41292, 
r=1,D=5, B=1.5, y=3 and S = {0.2,0.4,...,2} [4]. 

In the proposed new model, 4, a, P“ E, a and § are sat as the same value with old 
model. y = 0.3, Po = 1.5, Yọ = 3, So = 1, u = 1, so that the mean of service size r in 
our proposed new model equals with the mean service size 7 in Mei et al.’s work, 
namely E(r) =F = 1, and set d =+ +5, so that k = 5 and E(W)") = D = 5. 

Denote by (m, s) and pf the optimal configuration and maximum profit 
respective to old model. Denote by (m, s) and pf the optimal configuration and max- 
imum profit respective to new model. The profit of configuration (m, s ) obtained in 
new model is denoted by pf’. 


6.2 Performance Comparison 


Table 1 shows the obtained optimal configuration and maximum profit of old model. It 
can be seen from Table 1, when the multi-server system rents 10 long-term servers with 
execution speed | billion instructions per second, the service provider can gain max- 
imum profit 57.9124 cents interval a unit period time. 


Table 1. The optimal configuration and maximum profit obtained from old model. 


m |s pf 


6 57.9124 


The QoS to services is guaranteed by the both models, so the revenue keeps same 
constant in both models. 

In the first experiment, we explore the difference between configuration (m, s) and 
(m*, s ), and the difference between profit pf and pf’, with different sọ when parameter 
t = 1. The results are showed in Table 2. And Fig. 5a shows the changing trend of pf 
and pf’ with different sọ when t = 1. We can see that the figure shows the increasing 
trend of pf and pf’ when sọ is increasing from 0.2 to 2. That is because with the 
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Table 2. The difference between configurations (m, s) and (m., s ), and the difference between 
profits pf and pf’, with different sy when t = 1. 

(m, s) (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) 

pf 23.1808 | 46.8233 | 54.7041 | 58.6445 | 61.0087 | 62.5849 | 63.7107 | 64.5551 | 65.2118 | 65.7372 


(m*, s*) | (6,1.0) (6,1.0) 
pF 19.9803 | 43.6800 | 51.5799 | 55.5299 | 57.8999 | 59.4798 | 60.6084 | 61.4548 | 62.1131 | 62.6398 


















increment of so, the rental prices f and y are both decreased. Hence the cost on 
long-term servers and short-term servers are both decreased when the number and 
execution speed of servers remain unchanged. In addition, the obtained profit pf to the 
new model proposed by this paper is always greater than the pf’, which indicates that 
the new model can gain more profit than old model by more reasonable configuration. 
It is can be seen from Table 2, the optimal configuration obtained from new model is 
10 servers with 0.6 execution speed, while that obtained from compared old model is 6 


servers with 1.0 execution speed. 


Table 3. The difference between configurations (m, s) and (m“, s^, and the difference between 
profits pf and pf’, with different sọ when t = 1.5. 


(m, s) (30,0.2) | (15,0.4) | (15,0.4) | (15,0.4) | (15,0.4) | (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) | (10,0.6) 

































pf 11.8929 | 45.6617 | 56.4112 | 60.9118 | 63.2843 | 64.8931 | 66.0435 | 66.8462 | 67.4324 | 67.8758 
(m*, s*) | (6,1.0) | (6,1.0) (6,1.0) 
pf’ —38.608 | 29.9072 | 46.9823 | 54.1312 | 57.8999 | 60.1681 | 61.6569 | 62.6957 | 63.4542 | 64.0281 
80 
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Fig. 5. The pf and pf’ with different so 


In the second experiment, the configurations (m, s) and (m., s^, the profits pf and 
pf’ are compared, with different sọ when t = 1.5, and the result is shown in Table 3. 
Figure 5b shows the changing trends of pf and pf with different so when t = 1.5. It can 
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be seen that the changing trends of two curves are similar from Fig. 5a When So is 0.2, 
the value of pf’ is a negative. Which indicates that the optimal configuration gained 
from old model isn’t much truthful, when t = 1.5, and so = 0.2. 

In the third experiment, we explore the difference between configuration (m, s) and 
(m*, s ), and the difference between profit pf and pf’, with different t when so = 0.6. 
And the results are showed in Table 4. We can see that the configuration (m, s) changes 
from (6, 1) to (15, 0.4), when t changes from 0.5 to 1.5, then (m, s) remain unchanged 
when t increases. Figure 6a shows the changing trends of profit pf and pf’ with 
different t when so = 0.6. It is can be seen from the figure, the two curves show 
different changing trends, 1.e., the pf shows increased trend, but pf’ shows decreased 
trend. That is because when s is greater or less than 0.6, p and y show different trend 
with the increasing of q, i.e. p and y show decreased trend when s is greater than 0.6, 
but show increased trend when s is greater than 0.6. Furthermore, pf and pf’ shows 
different changing trends. 


Table 4. The difference between configurations (m, s) and (m”“, s ), and the difference between 
profits pf and pf’, with different t when so = 0.6. 








(6,1.0) 
33.3840 








Table 5 shows the results of the difference between configuration (m, s) and (ms ), 
and the difference between profit pf and pf’ are explored with a given t when sg = 1.2. 
Figure 6b shows the changing trends of profit pf and pf’ with different t when so = 1.2. 
From the figure, we can see that the profit pf and pf’ show increasing trend when T is 
increasing from 0.5 to 2.5. That is because with the increasing of t, p and y are both 
decreased when s is not more than 1.0. It is shown in Table 4, s and s“ are not more than 
1, so f and y are both decreased with the increase of t in both configurations. Although 
with the increase of the number of servers the rental cost shows increase trend, the 
decrease trend is deeper with the increase of the execution speed of servers, so the rental 
cost integrally shows increase trend. As a consequence, the profit of the multi-server 
system increases. In addition, the profit obtained of our proposed new model is greater 
than the profit of compared old model. And with the increasing of t from 0.5 to 2.5, the 
gap is more and more great. It is can be seen from Table 4, the optimal configuration 
obtained from this paper proposed new model is 10 servers with 0.6 execution speed 
when T is 0.5 and 1, 15 servers with 0.4 execution speed when Tt is from 1.5 to 2.5, while 
the counterpart obtained from compared old model is 6 servers with 1 execution speed. 
In essence, the difference of configuration leads to the gap of profit. 
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Fig. 6. The pf and pf’ with different t. 


Table 5. The difference between configurations (m, s) and (m., s ) and the difference between 
profits pf and pf’ with different t when sọ = 1.2. 














T 

(m, s) 

pf 

(m*, s*) | (6,1.0) 
pf’ 


7 Conclusions 


The problem of optimal multi-server system configuration for profit maximization in a 
three-tier cloud environment is investigated in this paper. We propose an optimization 
model to determine the optimal configuration such that the service provider can gain 
the maximum profit. There are two major differences with existing work. First, the 
maximum allowed waiting time is deemed random variable, which may change with 
the different service request. Second, the condition that the rental price of different 
execution speed may be different is taken into consideration. In addition, the perfor- 
mance of our proposed optimization model is verified by many experiments. The 
results of experiments show that the optimization model proposed in this paper can 
help the service provider gain more profit than existing work. 
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Abstract. Based on time-series detection algorithm, this paper puts forward a 
new analysis method for identify Network Element (NE) hitches. Aiming at 
specific characteristics of the NE, this paper propose a model which consider 
seasonal timing characteristics and impact of current data from recent data. 
Considering of multi-dimensional characteristics of NE, a density-based dis- 
covery algorithm is introduced into the modeling process. Experiments on the 
actual data coming from operates demonstrate the effectiveness and accuracy of 
the proposed methods. 


Keywords: Big data - Abnormity detection - Network element management 


1 Introduction 


In the field of network management, fault warning is a very advanced subject. First of 
all, the traditional method to deal with faults is to remedy the situation after mal- 
function, which is neither predictable nor effective. In this situation, operation and 
maintenance work will faces many difficulties. As a direct manifestation of devices 
status, the data which comes from network element can reflect the state of the device. In 
addition, it can helps operators to analyze and make decisions more accurately. 

The data structure and data mining are carried out to explore the abnormal data in 
the network equipment, which can helps user to quickly identify the possible failure 
and achieve the target of early warning. Nevertheless, there are great challenges analyst 
will meet in operator’s network management environment, because of the diversity and 
complexity of the data comes from network elements. It’s also worth mentioning that 
the method to get structured log data has objective impact on the result of anomaly 
detection. Consequently, a comprehensive anomaly identification scheme for network 
element log data is required for operation and maintenance work. 

Taking into account the different needs of operator business analysis, we propose 
diverse analytics solutions to match different business scenarios. In this paper, we 
propose a time-series-based abnormity detection method for identify network element 
hitches. Meanwhile, we propose a solution to meet the needs of high dimension 
analysis. In the experimental phase, we use actual fault data provided by the operator to 
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evaluate the precision of our algorithm. The experiment results indicate that our 
methods can find abnormal state of network element efficiently and accurately. 


2 Related Work 


The current anomaly detection algorithms can be divided into four categories: statistical 
method, clustering-based method, distance-based method and density-based method. 
Meanwhile, density-based anomaly detection is a hot topic in this field. Based on the 
local density anomaly detection algorithm [1] proposed by Breunig in 2000. Following 
researchers gradually improved and developed the algorithm to make it suitable for 
different scenarios. Some researchers attempted to use inexpensive local statistics to 
reduce the sensitivity of choosing parameter k [2]. In the paper of Lazarevic [3], Local 
Outlier Factor (LOF) is used on multiple projections and combines the results for 
improved detection qualities in high dimensions. [4, 5] presents effective methods to 
measure the similarity between objects, which can increase the stability and precision 
of outlier detection. But these algorithms is difficult to support large-scale data pro- 
cessing needs, because of the complexity of the network element’s log data. Alterna- 
tively, considering of the time characteristics of log data, timing analysis is another 
feasible way to detect anomalies. In the field of time series analysis, moving average 
model, linear regression model, polynomial regression model and exponential 
smoothing are common algorithms for time series analysis. In statistics, a moving 
average (MA) is a calculation to analyze data points by creating series of averages of 
different subsets of the full data set [6]. The weighted moving average model (WMA) is 
commonly used in the analysis of natural sciences and economics [7, 8]. Besides these, 
linear regression was the first type of regression analysis to be studied rigorously, and 
to be used extensively in practical applications [9]. The polynomial regression model 
which is advanced by Gergonne [10, 11] had been applied extensively in a lot of 
professions. Exponential smoothing was first suggested by Robert Goodell Brown in 
1956 [12], and one of the commonly used model is known as “Brown’s simple 
exponential smoothing” [13]. In exponential smoothing model, the current time can be 
suspected by the past time, while the impact of recent time is stronger than remote time. 
Another practical model to analysis time series is autoregressive integrated moving 
average model (ARIMA) [14, 15], this model can predict the future value of a specified 
time series on the basis of its past performance, but the basic ARIMA can’t retain the 
seasonal characteristics of a time series. 

All above models have their own advantages in time series analysis, and are 
effective mediums for anomaly detection. In the process of analysis of network ele- 
ment’s log data, both statistical method and density-based method can be reasonable 
applied based on their properties. 
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3 Anomaly Detection Based on Network Element Log Data 


3.1 Anomaly Detection Solution Design 


There are different concerns in the process of network element hitches detection. To fit 
different demands from network management analysts, we propose a composite 
anomaly detection method to identify network element hitches. The anomaly detection 
procedure is showed in Fig. 1. 


Collect log data from 


network element 


Convert the raw data 
into structured form 





Use hash algorithm to assign different 


ID for diverse modes 


Choose the target of 
analysis 


Detect specific anomalies Detect global anomalies 


Create ARIMA model Reduce the dimension of structured log data 


for specified mode by principal component analysis 









Use comprehensive time 
series model to detect 
specific anomalies 









Use LOF-based model to 
detect global anomalies 





Output the result of 
anomaly detection 


Fig. 1. Anomaly detection procedure 


If network management analysts would like to detect anomaly for a specific mode, 
they can choose our time-series-based scheme. In the process of specific anomaly 
detection, we combine qualitative analysis with quantitative analysis. On one hand, 
generally determine the presence of abnormal period by creating ARIMA model. On 
the other hand, use our comprehensive time series model to detect specific anomalies. If 
network management analysts would like to contain the features of network element in 
analysis process as much as possible, they may choose our LOF-based scheme. 
Considering of high dimension input, we use principal component analysis (PCA) to 
reduce the dimension and retain characteristics of raw data. In this scheme, analysts can 
get global anomalies from log data. 


3.2 Qualitative Analysis Based on ARIMA Model 


In autoregressive integrated moving average model [15], the data sequence formed by 
the prediction object over time is regarded as a random sequence. Once the model is 
identified, it can be used to predict the future value from the past and present values of 
the time series. 
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— The autoregressive integrated moving average model is defined as 


ARIMA (p, d, q) (1) 


p denotes autoregressive lagged item, g denotes moving average lagged item, and d 
denotes the order of difference. In our method we use ARIMA model to draw the curve 
of specific mode which can help us find suspected anomalies qualitatively. 


3.3 Quantitative Analysis Based on Comprehensive Time Series Model 


We can define current time t, the predicted value of our comprehensive time series 
model P, is comprehensive decided by the continuous time smoothing factor C, and the 
seasonal average factor S,;. The continuous time smoothing factor C, represents the 
impact of recent trends on current time. While the seasonal average factor takes into the 
impact from the same period. 


— The continuous time smoothing factor C, is defined as 
C; = By, + (1 — B)C-1 (2) 


where y; is the measured value at time ¢ and f denotes the weight coefficient which can 
measure the impact of recent trends on current time. 


— The seasonal average factor S, is defined as 


= Ta 
SE (3) 
Y; denotes the time units which are at the same period of Y,, n denotes the number 
of time units we concerned about. 


— The predicted value of comprehensive time series model P, is defined as 


max|aC, + (1 — «)S;] 
i (4) 


st. 4> 0. 


In this manner, we can reduce the influence of single factor on predicting value, 
also increase the tolerance of extremely value. Is easy to understand that P, retains the 
impact of recent trends and the relevance of data in the same period, which can predict 
the value of specified mode’s time series at time t reasonably and effectively. 


3.4 High Dimensionality Data Processing Scheme 


Considering of multi-dimensional characteristics of NE, we introduce a LOF-based 
algorithm to detect global outliers, which can encapsulate properties as node attributes. 
In the actual network management scenario, traditional LOF-based algorithm can’t 
solve the problem when the data volume is extremely huge. In other words, the 
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structured log data from network element has a large variety of characteristics, the 
traditional LOF algorithm will meet the performance bottleneck in computing process. 
Therefore, the PCA method is used to reduce the dimensions of input data, while 
preserving the feature information of multi dimension data. 

We use Euclidean distance to measure the difference between network element’s 
log data in different periods. 





The abnormal degree of log data is measured by local outlier factor (LOF) [1]. And 
the local outlier factor (LOF) is defined as 





= ao 
ary lrdx(p) 


ORP = TN 


(6) 
where N; (p) denotes the k-distance neighborhood of p. In other words, Ng(p) contains 
every element whose distance from p is not greater than the k-distance. The k-distance 
of p is defined as the distance d(p,o) between p and an object o such that there are at 
least k objects m € Dd(p,m)<d(p,o), and there are at most k—1 objects 


m € Dd(p,m) <d(p,o) 
The local reachability density of p is defined as 


INi(P)| 
rd (p, o) 


oEN;(p) 


lrd,(p) = 


where the reachability distance of object p with respect to object o is defined as 
rd,(p, 0) = max|k — distance(o), d(p, o)| (8) 


For a specified network element, the statistical distribution of log modes per hour 
form snapshots which can reflect the status of the network element. According to the 
definition of local outlier factor (LOF) [1], the higher the value of local outlier factor, 
the more likely the object is an anomaly. Based on this, we abstract the status of 
network element per hour as objects (denoted as x(t)) which can calculate their 
abnormal degree by LOF. Getting the values of LOF for every x(t), we can infer the 
occurrence time of anomalies. 


3.5 Scheme Characteristic Analysis 


Our model is based on time series analysis, taking into account the comprehensive 
effect of continuous temporal and seasonal timing. In our model, we use the opti- 
mization algorithm to optimize the parameters, which can avoid the impact of extreme 
value. It’s suitable for network management personnel who are familiar with specific 
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abnormal patterns of NE’s log data. Obviously, for network management analysts this 
method is effective, accurate and targeted-oriented. And that is to say, this model can 
effectively find the anomaly of specific log mode. 

The global abnormity detection of network element considering of multiple attri- 
butes of network logs, which can retain the comprehensive information of network 
element’s log data. But the existence of relevance and the mutual influence between 
attributes also has effects on detect results. Therefore, it’s suitable for network man- 
agement personnel have a global perception of abnormity. And that is to say, network 
management personnel can choose to find targeted anomaly and global abnormity 
flexibly according to different scenarios. 


4 Experiments 


4.1 The Analysis of Algorithm 


To detect anomalies of the specific mode, we input the time series of a specific mode, y, 
denote the frequency of the mode per hour. After input the time series, the compre- 
hensive time series analysis algorithm will find which time units are anomalies 
(Tables 1 and 2). 


Table 1. The comprehensive time series analysis algorithm 


Input the time series YV}, Y2, 3,---V;---V, Ofa specific mode 


for each time unit y, do 
compute the continuous time smoothing factor C, 
compute the seasonal average factor S, 


for each possible œ do — 
calculate ac, 1 (1 — a) S, with the parameter Q 


if aC, +(1—a)S, > max then 
max — aC, +(1—a)S, 
F — max 
for each P, do 
if |P — y,| is greater than threshold then 
put y, into output list 
Return a list of anomalies of the specific mode 


Considering of high dimension input, we use LOF-based model to find global 
anomalies. 
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Table 2. LOF-based anomaly detection algorithm 


Input the set of the statuses of network element per hour x(1),x(2), x(3)...x(t)...x(n) , and 
the detection threshold 6 

reduce dimensions by principal component analysis 

for each x(t) do 


compute the local outlier factor (LOF) of x(t) ,denoted as Lof (t) 
for each Lof (t) 
if Lof(t)>6 then 
put x(t) into output list 
Return a list of global anomalies 


4.2 Log Data Structure 


Word2Vec is used to distinguish the difference between logging modes, meanwhile to get 
structured log data. We use the SQL aggregation statement to aggregate the data to get the 
frequency of modes per hour. The result of aggregation operation is showed in Table 3. 

As can be seen in Table 3, all log modes are coded by hash algorithm, the last 
column indicates the frequency of modes per hour. In order to satisfy our analysis 
conditions, we convert the data to the following format. 

In Table 4, the frequency of diverse log modes can be abstracted as attributes of log 
status in Network Element. This form is useful for network management to positioning 
specific log mode. Moreover, it can be processed by dimensional reduction algorithms 
when the dimension of attributes 1s extremely high. 


4.3 ARIMA Analysis on Operator Actual Data 


In this part, we use ARIMA-based model to qualitative analysis the time series of the 
specific log mode. In order to facilitate the description we select one of the specific log 
modes which network management analysts are interested in. The analysis method of 
other log modes are just the same (Table 5). 

Firstly, we select a specific log mode as input for ARIMA model. Secondly, we 
generate the distribution curve of Autocorrelation Function (ACF) and Partial Auto- 
correlation Function (PACF) of the log mode’s time series. 

In Fig. 2, the Autocorrelation Function (ACF) trend decay and the Partial Auto- 
correlation Function (PACF) truncate after 1 order. This specific mode satisfaction the 
distribution of ARIMA(1,0,0). Therefore, we can create an ARIMA(1,0,0) model, and 
draw the curves of raw data and the predicted data of ARIMA. 

As can be seen in Fig. 3, ARIMA(1,0,0) model well captured the trend of the 
time-series of the specific log mode. Moreover, we can find that there are great dif- 
ferences between the raw data and predicted data. So we can conclude that there are 
suspected anomalies in 2016-8-17 and 2016-8-19. 
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Table 3. Input data structure 


8f6da0d89a8252e01944d8863786d52b | 2016-08-16 00 | 1103 
a63 la85c804c2bdc2afc80c596b0e005 =| 2016-08-16 00| 724 
661770dbcd0394f8db00ddd74352c007 | 2016-08-16 00| 723 
78e3e5d53db84e87a09a3996b09fdbb2 | 2016-08-16 00 | 719 
£24f62eeb789 199b9b2e467df3b1876b | 2016-08-16 00) 410 
fd7997adb870d78fa830348b5514fcOd | 2016-08-16 00| 238 
4586e5dlace7df6ac23ccel0ee8dfdb4 | 2016-08-16 00; 143 
ff4b2749ab7483d8f9a7a89d70e08c43 =| 2016-08-16 00| 143 
478458696bcf7c728 12ee5e623dedfda | 2016-08-16 00; 132 
13e208b8db3b0c2f9d956c02253cablaa | 2016-08-16 00; 132 
£4094a94a79al21bcd8al0960fe34a61 | 2016-08-16 00; 126 
8424707d0420cadee4c97ef4af2ef9f4 2016-08-16 00) 112 
£20e457d2828982ee27a640362891d63 | 2016-08-16 00| 89 
C3f091f048a28894fea7d50660dd6942 | 2016-08-16 00| 89 
©0085 18e807af2bec2356cf7ecd57b22 | 2016-08-16 00; 89 
004f4855ba92c878 1a6b8c2014fd9c2 2016-08-16 00} 60 
75849e828d2b48945d6a6453485b59ac | 2016-08-16 00 15 
731c019558bf9ed8ff96dc892e9b5325 | 2016-08-16 00 15 
9d2561 15355016cf657aa064a9f04862 | 2016-08-16 00 12 
6dbe23d02c718ad2118aef94479d132b | 2016-08-16 00 11 
1dd5ddd339089a28a3blc41dabfa87eb | 2016-08-16 00 11 
1b78efaa73d320280808e0b361b206bd | 2016-08-16 00 11 
13b97cb2bc33bd94b82e2c6b9c637725 | 2016-08-16 00 8 
609c79b375b6a32fa86a56f4136e5734 | 2016-08-16 00 8 
al5aa8107c65 1lce08ce8375746717ed1 | 2016-08-16 00 8 













































































Table 4. Input data transformation 


ae 204? oo40ds237004f 4895 005393de¢0081 £851 £00dl 7cc5C O0eca2f fe O0ed3¢9c201196a8at 018023324 027Sabebl 02c8082c$034fbd4dC0381059de 03faebb1c 04520686205 
| 2016-08-16 00 1 0 0 0 0 


~ 
eococooocooocoocoooooooooooo 


2016-08-16 21 
2016-08-16 22 
2016-08-16 23 


S 

: 

3 

5 
pò po pò pS po pd po po pd p po pd pd p ipd MMi IMi id IMM IMi iM M PO 
RSSRNSESSESRSSISESRSZESSSS 
eoocoooooooococooocoocoococeooo 
oooooooooooooococeeoeoooeo 
ooocoooocooorwocoocococococoococooech[( 
ceooooooooooooocoooooqooooo 
ecocococooocooocosocoosoccoescsooch[as 
oocooocoooooooococooocCocwsocclcol Oooo 
eoooooooooococooocooooqococeeo 
eococooocoocoococoscoocoosocoosoooesocso 
oooooooooooooocococqceooqocoeoeo 
oooooocoooocoocowoooooooosCc[clc 
eooooooooococooocoooooqococeo7oe 
ekewococoocoocoooocooooooocoooccoscoccs 


240 D. Zhang et al. 


Table 5. The specified mode 





ACF PACF 


0.8 - 


0.6- 


0.4- 


0.2 - 


0.0 - 





Fig. 2. Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) of 
specific log mode 
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Fig. 3. Timing distribution of anomaly detection 
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4.4 Comprehensive Time Series Analysis on Operator’s Actual Data 


After get the presence of abnormal period, furthermore, we use our comprehensive time 
series analysis method to quantitative find the anomalies. In this part we compare our 
algorithm with other common time-series-based algorithms. We choose the log data 
from network elements which has occurred hitches in recent year. For SAEGW 1304, the 
specified time range is 2016-08-16 00:00 to 2016-08-27 23:00. For another network 
element SZHSAEGW105BEr, the time range is 2016-09-02 00:00 to 2016-10-11 23:00. 

Experimental results show that our Comprehensive Time Series method can 
improve precision ratio to find anomalies of network element, but recall ratio greatly. 
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Fig. 4. Algorithm efficiency evaluation for SAEGW1304 and SZHSAEGW105BEr 
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4.5 High Dimensionality Data Processing Scheme to Find Global 
Abnormity 


In LOF-based anomaly detection process, the frequency of diverse log modes can be 
abstracted as attributes of log status in Network Element. We use LOF to measure the 
degree of outlier, meanwhile find global anomalies of log data. 

In Fig. 5, we can find that 2016-8-17 00:00 and 2016-8-16 23:00 are the global 
anomalies of log data, which caused the hitch of Network Element (SAEGW1304). 
Obviously, this method considering of multiple attributes of network logs, which can 
retain the comprehensive information of network element’s log data. But the existence 
of relevance and the mutual influence between attributes also has effects on detect 
results. 
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Fig. 5. High dimensionality anomaly detection 


4.6 Actual Case Verification 


In order to evaluate the accuracy of our detection method, we select a real case 
provided by the operator. Applying our detection method on log data from network 
element (SAEGW 1304), we can verify the effectiveness of our method. 

As can be seen in Tables 6 and 7, Comprehensive Time Series method detect the 
anomalies by 100% recall ratio. Moreover, Fig. 4 shows that the precision ratio for 
detection is higher than other common algorithms. Therefore, we can verify the 
effectiveness of our method detection the abnormal status from SAEGW1304’s log 
data. 


Table 6. The record of hitches of SAEGW1304 


Event description Associated log NE label Data description 
time 


A hitch occurred in August 16, 2016 SAEGW 1304 The data to be analyzed 
Power Supply Bureau’s 22:00-August 17, comes from 
network environment 2016 14:00 SAEGW 1304’s log data 


Guangzhou’s network 
element SAEGW 1304 
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Table 7. Comprehensive Time Series method detection result 


46dcld90 1e25a7ff0d19c9354e76d7a33 2016-08-17 04 
46dcld90 1e25a7ff0d19c9354e76d7a34 2016-08-17 05 
46dcld90 1e25a7ff0d19c9354e76d7a35 2016-08-17 06 
46dcld90 1e25a7ff0d19c9354e76d7a36 2016-08-17 07 
46dcld90 1e25a7ff0d19c9354e76d7a37 2016-08-17 08 
46dcld90 1e25a7ff0d19c9354e76d7a38 2016-08-17 09 
46dcld90 1e25a7ff0d19c9354e76d7a39 2016-08-17 10 
46dcld90 1e25a7ff0d19c9354e76d7a40 2016-08-17 11 
A0dcld90le25a7ff0d19c9354e76d7a41 2016-08-17 12 
46dcld90 1e25a7ff0d19c9354e76d7a42 2016-08-17 13 
46dcld90 1e25a7ff0d19c9354e76d7a43 2016-08-17 14 
46dcld90 1e25a7ff0d19c9354e76d7a44 2016-08-17 15 
46dcld90 1e25a7ff0d19c9354e76d7a45 2016-08-17 16 
46dcld90 1e25a7ff0d19c9354e76d7a46 2016-08-17 17 
46dcld90 1e25a7ff0d19c9354e76d7a47 2016-08-17 18 















































5 Conclusion 


In the background of efficient system operation and maintenance, we propose diverse 
analytics solutions to match different business scenario. In our method, the frequency 
characteristics of log data are abstracted as attributes of network elements’ attributes. 
Applying data mining algorithms to this data, we detect the specific anomalies and 
global anomalies to fit different demands from network management analysts. The 
result of our analysis method is verified by actual data, meanwhile is useful for 
operators’ fault early warning system. It can be expected to find more interesting and 
results along these lines, since the log data of network element contains abundant 
information. 
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Abstract. Traditional fingerprint orientation clustering algorithms often use k 
means clustering algorithm, but as a result of fingerprint and objective factors of 
volatile characteristics over time, k-means cannot adapt to change at any time in 
fingerprint, and cannot be generated adaptive clustering cluster number, cause the 
matching accuracy is not high. This paper adopts a based on support vector 
machine (SVM) and DBSCAN clustering algorithm, can generate continuously 
adapt to changing the optimal hyperplane fingerprint model, solved the fingerprint 
fluctuating lead to the problem of matching result is bad, and can be automatically 
generated in the process of matching classification number of clusters, based on 
Statistical density characteristics of DBSCAN selection matching probability 
model, to improve the positioning of the matching accuracy, reduced the amount 
of time matching positioning, positioning accuracy can be up to 2.04 m in the 
range of 57%, relative k-means 6.1 m increased by 52.3%, improve the positioning 
accuracy. 


Keywords: Location fingerprint - Clustering algorithm -SVM 


1 Introduction 


With the rapid development of mobile Internet and mobile terminal equipment, indoor 
positioning has become the front of data information technology research [1]. Under the 
support of the world’s four major satellite navigation systems, outdoor location services 
have been widely into people’s lives. While 80% of the daily time in indoor activities, with 
the increasing number of large buildings, indoor location services in commercial applica- 
tions, public safety and other aspects of the application of the most broad prospects. In 
indoor environments, satellite system signals cannot be used due to building occlusion and 
multipath [1]. At present, location fingerprint location based on signal strength (RSSI) is 
widely used in indoor positioning. Fingerprint localization makes use of the fingerprint 
feature of multipath non line of sight caused by the complex indoor architecture, and 
improves the positioning accuracy under the traditional positioning problem. 

Indoor positioning method (AOA, Arrival, of angle of arrival and time of arrival 
(Angle) location TOA, Time of Arrive) location, time difference of arrival (TDOA Time, 
Difference of, TDOA/AOA Arrival) positioning hybrid positioning, based on the 
received signal strength indicator (RSSI, Received Signal Strength Indication) location. 
TOA, TDOA positioning requires high-precision hardware synchronization, AOA 
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positioning requires directional antenna, and in the NLOS environment affected by 
multipath serious. WiFi positioning widely used RSSI technology, generally based on 
fingerprint matching positioning. The indoor environment is complex, but the pattern 
remained unchanged, the characteristics of the wireless signal formed in the specific 
position (the number of signals, phase, intensity) showed a special high, as the only 
“fingerprint” to identify the location, then according to the fingerprint matching algo- 
rithm to calculate the position. 

Improve the accuracy of fingerprint matching can generally improve the stability of 
RSSI fingerprint database, optimize the fingerprint database structure and improve the 
clustering algorithm three aspects. Starting from clustering algorithm, the traditional K- 
means. 

Is based on the concept of data partitioning, can not be adaptive optimization 
matching model, ignoring the data model should continue to change with the test vector 
changes. Positioning accuracy is easily affected by time weather and objective factors, 
positioning accuracy is difficult to further improve. 


2 Matching Positioning Method for Position Fingerprint 


This method is generally divided into off-line training and online matching stage. The 
off-line training phase of the mobile device to acquire the RSSI sequence, AP sequence 
signal intensity and location information for the RSSI signal vector and stored in the 
fingerprint database, the formation of space vector space according to a certain density 
of fingerprint acquisition point position information (Fig. 1). 





Fig. 1. Single AP signal intensity distribution of National Grand Theater 


3 Several Data Clustering Methods 


When the amount of fingerprint data is large, clustering algorithm is widely used to 
optimize the structure of fingerprint database. Generally based on the data mean division 
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method, density based density method, data model based machine learning methods. 
This paper uses divisive hierarchical clustering algorithm, the entire data set as a cluster, 
and then use the adaptive mesh split into multiple clusters in data modeling, the entire 
data set model, using two different clustering algorithms are fused to play two kinds of 
localization algorithm advantages, makes the use of the smallest the data space to get 
the location information of the most useful, to improve the positioning accuracy. 


3.1 K-means Clustering Based on Data Mean 


For the K nearest neighbor method, the K nearest neighbor points are given different 
weights according to certain rules: 


1 K 
K 2 w;(X;, y;) 
= (1) 


(1) We need to give K neighbor points different weights to improve the system posi- 
tioning accuracy, learn from the classic signal positioning algorithm based on active 
RFID system. The system selects some reference points and arrangement of refer- 
ence tags in the positioning area, the signal strength compared to the tag and refer- 
ence tag values, obtained several reference tags to be located closest to the label, 
and then based on these coordinates weighted reference tags to estimate the coor- 
dinates positioning label. Using the weight design idea of the system, the weights 
are redesigned. 

(2) The size of the Euclidean distance D square reflects the weight change, the smaller 
the D, the greater the weight, and directly depends on the current Euclidean distance 
Di, enhanced the Euclidean distance Di weight ratio (Fig. 2). 
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Fig. 2. National Grand Theater K-means clustering results 
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3.2 Density Based DBSCAN Algorithm 


(1) (Eps neighborhood) given a data object P, P Eps neighborhood defined as the core 
of P, Eps radius of the D dimensional hypersphere sphere region, that is, Numbered lists 
should use the “Numbered Item” style. 


Neps P) = {q E D\dist(p, q) < Eps} (2) 


Data point density and DBSCAN group connected to a cluster. The density of points 
m/nV, M is the number of data points that N input falls in a small region around the 
point and V is the area of the volume. The region is considered to be a hypersphere 
radius, so the MinPts threshold density can be determined by specifying a parameter. 

For given and MinPts values, DBSCAN finds a dense point input set and expands it 
by merging adjacent dense regions. A point can be a dense (core) or non-dense point 
(non core business point). Non-dense point is a boundary point dense region or noisy 
pattern. 

In the plane space data density space vector set as fingerprint clustering based on 
density clustering algorithm, the number of clusters is not need to set in advance, so this 
method is suitable for clustering point distribution of unknown data sets clustering. 
Based on density of DBSCAN is a widely used mesh algorithm, the super ball algorithm 
number of data samples within the region to measure the regional density of the fast. 
DBSCAN algorithm can discover clusters of arbitrary shape, and effective identification 
of singular points. The discovery of DBSCAN cluster grid data and characteristics, and 
each cluster according to the data of different densities of different probability markers, 
matching the priority of such arrangement has probability sequence cluster set the 
maximum probability in the process of matching, the matching accuracy is very good 
optimization (Fig. 3). 





Fig. 3. Grid partition result 


D-SVM Fusion Clustering Algorithm 249 


3.3 SVM (Support Vector Machine) Clustering Based on Data Model 


After using DBSCAN to divide the grid, then use SVM to build the data model to carry 
on second levels of clustering. The training set is linear SVM, assuming that the size of 
N training set consists of two parts, respectively for the class and, if that is marked, if, 
if there is marked. Ultra flat g (x) the training vector is correct classification of the 
training sample set is linearly separable 


g(x) =axt+b=0; 
œx; +b > 1; = 1); (4) 
ox,+b<-10; = —1). 


When the training set is nonlinear, the input feature space by a nonlinear function is 
mapped to a higher dimensional space m, X € R” —> X € R” these classes can be used 
to classify a hyperplane and get discriminant function. 


o®(x)+b=0 (5) 
Discriminant function: 
f(x) = sign[@®(x) + b] > 1-6, (6) 


In the formula C is the parameter, the objective function should look for the 
maximum hyperplane as far as possible and the guarantee data point deviation quantity 
is smallest, parameter C controls before two weights. 


4 DBSCAN-SVM Clustering Algorithm 


4.1 Research Method 


A support vector machine based on combination of DBSCAN and indoor positioning 
method, which is characterized by the largest cluster DBSCAN quickly find the 
advantage and advantage of support vector machine to solve the super plane model, to 
achieve the desired effect of clustering. The original data points using DBSCAN to select 
all kinds of density density cluster and number based on the traditional local mesh 
according to the substitution, the purpose of doing so can eliminate the singularity and 
little area quickly, so as to optimize the positioning accuracy and matching speed. 


4.2 Implementation Methods 


The original data points using density based DBSCAN select various density clusters. 
After finding the clusters of various shapes at the fastest speed, the singular points are 
excluded and the probability density of each cluster is sorted. In the matching process, 
the cluster with higher probability is preferred (Fig. 4). 
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Estimated number of clusters: 8 
e 





Fig. 4. DBSCAN first layer clustering results 


4.2.1 Cluster Combination 
Divides the good class cluster according to the number to carry on 22 combinations. 
Then the SVM classifier is used to classify the 22 clusters, and the total number of 
classification models is N(N — 1)/2. 
To find the best hyperplane classification, the next step is to solve a constrained 
extreme value problem: 
SVM optimal classification surface problem can be described as: 


| 2 
min(; loll ) i=1,2,+.-,n (7) 
S, t.y;(@r! +b)>1 


By means of formula (6), we can solve an optimal hyperplane of classifying the data 
of the two kinds according to the mathematical model. Before the N data samples 
obtained by the cluster, with permutations and combinations into the SVM classifier 


training, N(N — 1)/2 hyperplane model DBSCAN (Fig. 5). 


1 a 
min O(@) = sllell’ Ce. 
i=l 


y(@®(x,) +b) = 1 —-€, 
(é,>0,i=1,2,+,N), 


(8) 





Fig.5. Fingerprint database statistical positioning results and optimal classification plane sketch 
map 
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The penalty function can be described as the distance between the hyperplane model 
and the measured point in order to avoid over fitting the correction factor to the distance. 


4.2.2 Location Process 

The signal intensity received by the positioning point is input into the established model, 
and the probability density method is used to select the maximum probability density 
model. The first step, based on the probability of the K regions obtained by classification, 
calculates the weights (Fig. 6). 
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Fig. 6. Error cumulative distribution function 


5 Conclusion 


This paper introduces a fusion clustering algorithm in indoor positioning applica- 
tions, used to solve the existing RSSI volatility, quickly find clusters. After testing, 
can effectively increase the matching accuracy, reduce matching time, improve posi- 
tioning accuracy. 
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Abstract. Aircrafts have been widely used in recent years with the rapid devel- 
opments in economies. In consideration of aircraft security and the importance 
of navigation information, numerous sensors have been utilized for aircraft 
tracking. Deviations are easily produced by sensors and have significant influence 
on information accuracy. The existing literature on methods to monitor sensors 
and evaluate their deviations is limited. In this paper, the information detected by 
sensors is compared to the reference trajectory. Subsequently, a novel method to 
calculate deviations and errors is proposed. To validate the reliability of the eval- 
uation results, the Chi Square test is integrated into the research method. A stat- 
istical method is also proposed to identify the existence of deviations between 
two sensors. 


Keywords: Sensors - Deviations - Chi Square test - Test statistics 


1 Introduction 


Aircrafts have been widely used in economies, and flight safety has gained increased 
importance and attention. Many types of sensors have been deployed to acquire and 
monitor location information in different domains, such as the military, medical care, 
and environmental management. Multiple sensors are generally used to identify 
geographical coordinates precisely. However, various environmental factors can cause 
sensor deviation, which results in inaccurate data collection. These factors should be 
considered when deploying sensor probes. Thus, the calculation and the analysis of 
sensor deviations are crucial issues. 

In most recent studies, sensors are considered as suitable tools to investigate aircraft 
information. In [1], the airspeed sensors were used to reconstruct the approach-path 
deviations through triangulation and the flow sensors controlled the touchdown phase. 
The authors in [2] demonstrated a trajectory recommendation framework that relied on 
various sensors. The authors of [3] reported that trajectories were difficult to predict the 
precise location of thrown objects. A sensor system, which received data from distance 
sensors, was responsible for detecting the deviations between the actual and predicted 
values. Then an algorithm to measure the in-between points during flights was proposed. 
The authors of [4] used inertial sensors to gather optical flow information and presented 
a set of novel navigation techniques to estimate the ground velocity and altitude of 
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aircrafts. Two formulations were proposed to calculate aircraft velocity and altitude. A 
novel method that used a ToF [5] camera to detect an aerial target was proposed in [6]. 
On the basis of a special requirement, the sky was considered as noise in the image and 
the target could be recognized from a model of complete randomness, thus illustrating 
the robustness of sensor detection with respect to the number of false alarms. In addition, 
the authors conducted a study to analyze the detection performance. In [7], the accuracy 
of sensors was noted. The authors proposed a linear covariance technique to predict the 
accuracy of altitude determination systems for flights. The authors also carried out a 
study to analyse the performance of the detection. In [8], a radar sensor was placed onto 
a UAV, which calculated the position deviations of aircrafts from a theoretical trajectory. 
A SAR processor and navigation system determined the difference between theoretical 
and actual values. 

To the best of our knowledge, the literature on the evaluation of sensor deviations 
is limited. In the present research, we build the reference trajectory and propose a novel 
method to calculate and evaluate the deviations for each sensor. The feasibility of the 
obtained data is evaluated through residual analysis using the Chi Square test. We also 
propose a statistical method to determine whether the deviations exist between two 
sensors or not. 

The rest of this paper is organized as follows. Section 1 discusses the potential 
application of sensors in the aviation industry and the importance of evaluation methods. 
Section 2 presents the proposed calculation and evaluation methods. Section 3 presents 
the statistical method for detecting the deviations for each sensor. Finally, the study is 
concluded in Sect. 4. 


2 Deviation Calculation and Evaluation 


To obtain the deviation, we should build a reference trajectory first. The reference data 
need to be transformed according to the local sensor framework. If the total number of 
points tracked by two sensors is larger than 50, then high-accuracy sensors are used to 
plan the trajectory. The reference time is calculated as follows: 


(Min(time,) + Max(time,)) 
OQ 


(1) 


Time ep = 


The reference trajectory of a sensor is built through the least square fitting of a poly- 
nomial. We sum up of the values in the set deviation. The sensor deviation is equal to 
the average of values. In order to calculate the deviation between two sensors, the devi- 
ation for each sensor corresponding to the reference trajectory should be obtained. The 
tracks of sensor S, are recorded in the set ¢,. The deviation of sensor S, can be defined 
as follows: 





i= 5 = = (2) 
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where b, |, and &, are the deviation between S, and the reference trajectory. The sum of 

values in the set ë, is expressed as Xs, and variable N, is the number of the points 

tracked by sensor S,. The deviation between sensor s, and s, is defined as follows: 

Ds ts = Deir, Ds ir, (3) 
The errors induced by the deviations could be obtained simultaneously by calculating 

the standard deviations of a sensor as follows: 


a= DEn) (4) 


X 


The variable o, denotes the average value of the standard deviations: 





VN. (5) 


The standard deviation for each sensor could be obtained easily. After acquiring the 
standard deviations for sensors s, and s,, the errors induced by the deviations between 
these two sensors can be calculated as follows: 


Z 2 2 
O evaluate = ae + 2 (6) 


Subsequently, the methods to achieve the research objective are tested because the 
sensor errors are obviously normally distributed. In this paper, residual analysis is first 
utilized to identify the feasibility of the data. Then, the residual error is tested by Chi 
Square test. The residual error must first satisfy the test above to obtain accurate data 
and information. Otherwise, the deviation is discarded. Sensors have different preci- 
sions. Thus, the difference for each sensor should be normalized. The normalized differ- 
ence of sensor X is defined as follows: 


„oS 
E (7) 


where øo’ is the sample-standard-deviation. Normalization is applied on the two sets, and 
the results are sorted and stored in the set o**. 

The normalized differences and their order are analyzed. The possible samples are 
divided into discrete areas, which are used to calculate the expectation of samples. The 
Chi Square test is conducted to identify the consistency of samples and expectation 
distributions, that is: 


k BoE 
poy O (8) 
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where k is the number of discrete areas and n, is the distribution of the expectation. The 
distribution of samples is expressed by n,. Subsequently, X? is compared to the hypo- 
thetical threshold. If the threshold is not exceeded, the distributions of the samples are 
considered consistent with the expected distributions. Otherwise, the residual error could 
not be used further. 


3 Surveillance of Deviations 


Apart from evaluating the deviations, the results from calculations also can also help in 
monitoring the state of sensors. In this study, a statistical method is employed to identify 
whether the deviations exist between the two sensors. Two aspects are considered in 
detecting deviations: the detection for a single sample and the detection for accumulated 
samples. Both are described in the next sections. 


3.1 Detection for a Single Sample 


The detection for a single sample transforms the reference trajectory and deviation 
calculation data, such as distance, orientation, and elevation, into the same coordinate 
systems. Unlike the deviation estimates, the result of deviation calculation constructs a 
test statistic that is used to identify the consistency of the deviation for the reference 
track in a statistically significant way. If the statistic is greater than the predefined 
threshold value, the probability of significant deviation between the two sensors is 
greater. Otherwise, the obvious deviation is not considered. Three steps are required to 
detect a single sample: calculation of deviation, calculation of test statistics, and calcu- 
lation of examination. Each step is described in the following sub-sections. 


3.1.1 Calculating the Deviation 

For parameter p, the method to obtain the deviation is similar to the method used in 
calculating sensor deviations. The difference between the data tracked from sensors and 
the reference value is obtained by 


é, (i) = (P — pO) (9) 


where €, œ indicates the difference of sensor s at point i. p, is the parameter (e.g., 
distance, orientation, and elevation). The reference value of parameter p is expressed 
by p(i), which can be formulated as follows: 


jen 


BW = È clh- toy)’ (10) 


j=0 


where t, is the measured time for point j and c; is the polynomial coefficient calculated 
by reference trajectory. The duration for constructing reference trajectory is expressed 


by tref 
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3.1.2 Calculating the Test Statistics 
The Strudent method is utilized to monitor the deviations between two sensors. We 
define a variable ¢,, as the result of monitoring in terms of parameter p, that is: 


(11) 





2 
05 


As for parameter p, the existence of obvious deviation is considered if ¢, is larger 
than T, . Ift, is less than T, , the two sensors are considered to be working cooperatively 
without devianon. Otherwise, the present study is unable to determine whether deviation 
exists between the two sensors or not (1.e., in terms of parameter p). In this case, the 
detection for accumulated samples is used to investigate the deviation between the two 
sensors. 


3.2 Detection for Accumulated Samples 


The detected data, such as starts, stops, average difference are recorded. Then, the results 
of the detection for each single sample are counted. For the two sensors, the detection 
for accumulated samples can be defined as follows: 


(12) 





(N; - 1)o?, + (N, — 1)o?, 


where ¢ is the detection result of the accumulated samples. The standard deviation is 
express by o,, The variable N, is the number of records for parameter p. The total 
number of data, which determines the existence of deviation, is stored in variable k,. 
Similarly, the total number of data without deviation is k». If k, < 2 x k,, the accumulated 
result is considered ambiguous and the accumulated samples will be detected in further 
tests. 

The multiple data calculation method is used to monitor the deviation between two 
sensors. We combine the results of tests, and then achieve the cumulative tests statistic. 
The multiple data calculation method is defined as follows: 


N, =n +n, (13) 
Wa, + Wa, 
r- m p eaa 
5 w, +w, (14) 
On = VW, tW: (15) 
1 
W= 2 (16) 


Research on Evaluation of Sensor Deviations During Flight 257 


where a, is the average value of sensor x, and the variance is expressed by o,.. n, are the 
points of sensor x. 


4 Conclusion 


This paper discussed the importance of the track points during flight and presented 
several practical methods to analyze the accuracy of the data obtained by sensors. We 
first constructed the reference trajectory, and then transformed the tracked data into the 
same coordinates system. The deviation between two sensors was evaluated, and then 
the Chi Square method was used to investigate the correctness of the deviation evalua- 
tion. In addition, we proposed a statistical method to identify whether the deviation exists 
between sensors or not. Future work may consider more realistic situations, more algo- 
rithms to calculate deviation, and more evaluation methods to determine the accuracy 
of deviations. 
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Abstract. Satellite networks support a wide range of service and provide global 
coverage. Moreover, LEO satellites, which have global coverage potential, have 
caused great attention in the field of satellite communication. Due to satellites’ 
mobility, an efficient and stable routing algorithm is necessary for transmission 
in satellite networks. In this paper, a Distributed Multi-Path Routing algorithm 
(DMPR) using Earth-fixed satellite systems is proposed. DMPR adopts the Polar 
orbit constellation in which each satellite has four ISLs (inter satellite link) with 
the neighbor satellites. A traffic-light is set for every ISL to judge the blocking 
while avoiding some busy path and reducing the delay. 


Keywords: Low earth orbit - Polar orbit constellation 
Distributed multipath routing - Traffic light 


1 Introduction 


Satellite networks are becoming important in the field of communication, especially the 
LEO satellites. Compared to the terrestrial networks, the LEO satellites have global 
coverage potential. Besides, they can also offer services with lower latency than the 
geostationary satellites. Therefore, the LEO satellites communication has aroused great 
interest in academia and industry. 

The satellites’ topologies are dynamic with frequent link switching and interruption. 
Since the satellite constellation is determined in advance, the topologies will be predict- 
able. These characteristics, different from terrestrial networks, bring many difficulties 
in routing design. Many schemes have been proposed from researchers since the 1990s 
[1]. Topological control strategy can generally be adopted to shield the topology’s 
dynamic. There are two topology control strategies: Virtual topology [2, 3] and virtual 
node [4, 5]. Virtual node strategy in which use the concept of satellite logical location 
form a global virtual networks. Each node in the network is virtual node, which is serv- 
iced by the nearest satellite. 

Werner [6] introduces the DT-DVTR routing algorithm (Discrete-Time Dynamic 
Virtual Topology Routing) in the LEO satellites networks. They divided the routing 
strategy into two processes: firstly, they set a virtual topology for all successive time 
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intervals of the system period, providing instantaneous sets of alternative paths between 
all source destination node pairs; secondly path sequences over a series of time intervals 
are chosen from that according to certain optimization procedures. Korcak [7] proposed 
a multistate virtual network (MSVN) topology, provided formal mathematical model 
for it and discussed its contribution to the overall system availability. Then Lu et al. [8] 
formalized and optimized the virtual topology based on virtual node strategy while 
elaborating its important features. 

There are many distributed routing algorithms [9, 10] based on virtual nodes. A 
datagram routing algorithm is proposed for LEO satellite networks where routing deci- 
sions are made on per-packet basis. Henderson and Kartz [11] has proposed a distributed 
routing algorithm in which they flood a link state packet only as the routing radius for 
a given satellite and then they choose the best path using Dijkstra algorithm. 

This paper is organized as follows: Sect. 2 introduces the satellite network architec- 
ture that has the Earth-fixed footprints. Section 3 proposes the new routing strategy. 
Section 4 presents the simulation result of the new routing algorithm and in the end 
Sect. 5 summarizes this paper. 


2 Satellite Network Architecture 


The satellite network adopts the Walker star [12] constellation which is composed of N 
polar orbits (planes), each with M satellites at low distances from the Earth. The planes 
are separated from each other with the same angular distance of 360°/(2 * N). They cross 
each other only over the North and South poles. The satellites of each plane are separated 
from each other with an angular distance of 360°/M. Due to the circular planes, the 
satellites in the same plane have the same radius all the times. In this paper N is 6 and 
M is 12. 

Each satellite has four neighboring satellites: two in the same orbit and two in the 
left and right planes. So each satellite has four transmission directions which means that 
when deciding the next hop node each satellite has four choices. 

In our analysis of virtual topology dynamics, we consider a satellite orbit with Nsar 
satellites that have Earth-fixed footprints, and serve for total of Npp footprint areas along 
the orbit. The footprint means the ground area is serviced by satellites. Firstly, average 
number of satellites per footprint area (N Pi )is calculated as 


Ne = Near /Nep (1) 


In traditional VN (virtual node) concept, N<471s equal to Npp, 1.e., there exists a single 
satellite for each footprint, namely N%” = 1. However, for the sake of increasing system 
availability, this paper also considers the case where there is more than one satellite per 
footprint. Target on improving the transmission success rate while simplifying the 
handover process, we set New to 2 as shown in Fig. 1. 
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Fig. 1. The satellite system and virtual network with New =2 


In the virtual networks, we use <n, m> (n = 1, 2, N; m = 1, 2, M/2) to represent the 
virtual footprint. Each satellite will also be assigned a corresponding address as the basis 
for the transmission. Taking into account the mobility of satellites, the satellites’ coor- 
dinates need to be updated periodically. Ignoring the impact of the Earth’s rotation, we 
only update the value of m. There are M/2 footprints in each orbit, hence each satellite 
has M/2 different m values. The satellite updating period (Tid) can be updated as: 

Tid = Tsar! (M/2) = 2 + T/M (2) 


upd 


Where Ta means the satellite movement period. 


3 Routing Strategy 


DMPR routing algorithm is a kind of distributed routing algorithm. Therefore, each 
satellite node can independently determine their next hop node depending on ISL 
congestion. For the satellites, a buffer queue for each ISL was set to temporarily store 
packets sent to the next hop satellite through this ISL. A buffer traffic is also set to 
represent the congestion condition for this ISL. To select the next hop node (or the 
transmission direction) the satellite will consider the current satellite’s congestion status 
of a buffer queue. 

Figure 2(a) shows how to set the color of traffic light according congestion. In our 
algorithm, there are only two types of traffic light color: “GREEN” and “RED”. First 
of all, BO(n) (n = 1, 2, 3, 4; meaning there are 4 ISL in every satellite) is defined as 
the buffer queue of satellite at direction n. And the buffer queue occupancy rate BOQR 
is defined as the rate of packets’ number in the queue to the length of the queue. So 
BOOR,, represents the occupancy rate of BO(n). Then we define a threshold T to 
distinguish the two status. When BOQR,, grows beyond T, the queue is considered to 
be congested and the traffic light turns to “RED”. There is a special situation: when 
the satellite checks BO(n), BOQR,, is very close to but still below T, so the traffic 
light remains “GREEN”. Just after checking, BOQR,, is beyond T, but the satellite has 
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Fig. 2. Traffic light color setting scheme (Color figure online) 


to stay wrong status for At (checking interval). To ensure that BOQR, will not exceed 
T during this checking interval, T should meet:- 
T - L- APS > (I — O)na ` At (3) 

Where L represents the size of buffer queue, APS represents average packets size, 
and (I-O) max denotes the maximum difference between input and output traffic rates. 

When the satellite moves to high latitudes, the distance between them is closer than 
the low latitudes as shown in Fig. 2(b). The distance between satellites in orbit L2 can 
be described by: d= a xX (RX sina + h) + N 

(R means the Earth radius, h means the satellites height and N means the number of 
orbits). 

According to Shannon formula, the channel capacity can be described by: 


S Pp ° hs 
R, =B: log(1+ 2) =B: logd + 2—*) (4) 
N o? 


h,, means channel gain and h,, ~ E(d “). It can be seen satellites’ channel capacity 
in high latitudes is larger than low latitudes. So the buffer size of inter-ISL can be set 
higher when satellites moves to high latitudes. This means when satellites serving for 
different footprint their inter-ISL buffer size change dynamically. 

The routing algorithm begins with sending data from the terrestrial device to the 
satellite. Each terrestrial device can send date to 2 different satellites, because every 
footprint is served by two satellites. When the current satellite determines the next hop 
node, it will compare the footprint’s coordinate to decide the direction. 

<Nng, M,g> is the destination footprint coordinate and <n;, m;> as the current satel- 
lite’s footprint coordinate. The current satellite will choose at most two directions to 
transfer data. The vertical and horizontal movements are described by 
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+1, upward +1, right 
d,=4 0, nomovemet d, =4 0, nomovement (5) 
—], downward —1, left 


During a date receiving processing, firstly current satellite will check whether the 
same data has been received. After checking, satellites can avoid transferring the same 
data twice, so it can save source and improve the system’s performance. If not, it will 
compare its own footprint coordinate with the destination coordinate to choose the 
direction. For example, if m; > mg, d, = +1 means the current satellite chooses upward 
as the backup direction. If n; < ng, dp = +1 means the current satellite chooses right as 
the backup direction (Table 1). 


Table 1. Direction selection scheme 





After choosing two backup directions, the satellite will enhance direction by taking 
congestion into account. If the backup ISL’s traffic light is “GREEN”, the satellite will 
transfer data through this ISL. If the backup ISL’s traffic light is “RED”, the satellite 
will discard this ISL. 


4 Simulation and Evaluation 


In this section, we mainly analyze the performance of satellite transmission delay and 
success rate. In satellite communications, congestion and satellite failure are important 
factors affecting data transmission. In this experiment, the average one-hop ISL propa- 
gation delay is set to 14 ms. When the ISL is set to “GREEN”, the maximum queuing 
delay is set to 10 ms. When the ISL is set to “RED”, maximum queuing delay is set to 
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Fig. 3. Simulation of transmission delay and success rate 
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30 ms. In simulation, inter-orbit link’s buffer size is set dynamically. Satellites’ buffer 
size (BQ(n)) is 20% smaller than higher latitudes, so does the threshold T. 

Figure 3(a) shows the two routing algorithm’s latency increases with the proportion 
of blocking ISL. DMPR’s latency is shorter than DT-DVTR’s obviously. When the link 
is blocking, the satellite will wait which will increase queuing delay using DT-DVTR 
routing. So the impact of blocking on the delay is small while DT-DVTR’s delay 
increases rapidly with the increase in proportion. 

In DMPR routing algorithm each satellite node has at most two choices when 
selecting next node. Even if a node fails, it can also find an alternative node for trans- 
mission which can mitigate the adverse effects caused by node failures. So in the 
Fig. 3(b), it shows the success rate of DMPR is higher than DT-DVTR’s especially when 
the number of disabled satellite exceeds 5. 


5 Conclusion 


In this paper, we have proposed a new distributed routing algorithm for LEO satellite 
networks. When choosing the next hop satellite node, the satellite will take congestion 
into account. So our routing strategy has an advantage in reducing latency when encoun- 
tering congestion. Meanwhile, this routing algorithm can guarantee the success rate 
when the node fails. 
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Abstract. In this paper, smart home is the research object. It analysed the intel- 
ligent acquisition for monitoring data, put forward a kind of solution for remote 
monitoring system based on wireless sensor network, and designed a set of perfect 
intelligent remote monitoring system by using the mobile terminal platform. This 
system can achieve a series of basic functions, including monitoring, automatic 
adjustment, alarm and control for the home appliances, which can provide 
comprehensive and reliable home environment information for the home users 
and meanwhile achieve remote control for home appliances. Finally, the smart 
home monitoring system was deployed and tested, the test results show that the 
system can achieve the basic functions of the modern smart home and has the 
unique advantages in performance, cost, versatility, scalability and other 
aspects. 


Keywords: Smart home - Wireless sensor - ZigBee - HTMLS5 - WebSocket 


1 Introduction 


With the rapid development of wireless sensor network, smart phone communication 
and Internet technology, the remote wireless mobile home monitoring system will 
become one of the mainstream smart home network [1]. The smart home system needs 
a lot of information acquisition and transmission. Introducing the wireless sensor 
network technology into the construction of smart home can build a “home appliance 
network” with adaptive control function for smart home [2]. This network can not only 
make the smart appliances conduct mutual coordination, but also be interacted with the 
external network. Therefore, users can achieve the remote control of the smart appliances 
through this network. 

At present, the domestic and foreign major operators are integrating resources and 
carrying out innovative business to occupy the smart home market [3—5]. Although the 
smart home industry is developing relatively quickly, the present level of development 
is uneven. It is faced with many problems like the lack of unified industry standards, 
expensive price, complex operation and the leakage of information which restrict the 
development of the smart home industry [6, 7]. In this research, the field research was 
conducted to find out the deficiencies of the current smart home system. And the envi- 
ronmental parameters required for the monitoring system were determined through the 
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analysis of actual functional requirements. Combined with ZigBee wireless network 
communication technology and the research of the intelligent collection for monitoring 
data by sensors, a set of perfect intelligent remote monitoring system for smart home 
was designed through the mobile terminal platform. And the debugging environment 
was deployed to analyze and verify the research results. 


2 Overall Design 


The overall design is based on the research about the wireless sensor network tech- 
nology, wireless data communication technology, HTML5, WebSocket and other 
related theoretical basis [8]. Combined with corporate field research and actual func- 
tional requirements analysis, the overall design of the smart home monitoring system 
based on the wireless sensor network was determined generally, as shown in Fig. 1. 
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Fig. 1. Overall design 


The system collects the required information of the family environment by the related 
sensors and controls the switch of the household electrical appliances by the controller 
modules. In this system, the sensor modules and the controller modules are connected 
to the ZigBee terminal equipment. The data transmission is based on the ZigBee wireless 
network and the ZigBee center coordinator. The ZigBee center coordinator 1s connected 
to the server through the serial ports. It can save the data collected from the sensors into 
the server database or sent the data to the client directly. According to the implementation 
process of the system, the overall scheme includes the front-end sensors and the control- 
lers module design, the ZigBee wireless sensor network design, the data transmission 
module design, the server program design, the data interaction module design, and the 
client application software design. 


Optimization of Smart Home System 267 


3 Wireless Sensor Network Design 


3.1 ZigBee Wireless Network Design 


The ZigBee wireless transmission module of this system used TI wireless RF chip called 
CC2530F256 as the core chip. CC2530F256 is compatible with the ZigBee protocol 
stack, providing a powerful and complete ZigBee solution. Because the transmission 
distance is relatively short and the family housing area is limited, the system just needs 
relatively small sensor nodes and a simple data transmission structure, which does not 
need the router nodes to extend the network coverage area [9]. Meanwhile the position 
and the number of the household appliances are easy to change, which will change the 
network state. Therefore, under the premise of the communication requirements, consid- 
ering the cost of the equipment and the simple structure, this system used the star 
topology to build the smart home wireless network. The star topology wireless network 
consists of a ZigBee full-function central coordinator and several terminal functional 
devices. And the data transmission is based on the principle of ZigBee wireless data 
transmission. 

In this paper, the DS18B20 temperature sensor module was selected as an example 
to introduce the concrete design process of the data acquisition and transmission soft- 
ware. First, the system obtained the collected data from the terminal device. Then the 
data was transmitted to the central coordinator. Finally, the coordinator sent the data to 
the sever module through the serial ports. In order to observe the results of data acquis- 
ition expediently, the coordinator used LCD to display the collected data values. Mean- 
while the data in the terminal nodes and the coordinator node was sent to the computer 
through the serial ports and shown by serial debugging assistant. In order to enhance the 
system layout and installation convenience, meanwhile reducing energy consumption, 
the data acquisition nodes all used the battery power [5, 10]. The node modules will go 
into sleep mode in leisure time and be wake up when the system needs data acquisition 
or transmission. The entire workflow is shown in Fig. 2. 
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Fig. 2. Temperature sensor node workflow 
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3.2 Development and Encryption of the Data Communication Protocol 


The data communication protocol was designed to solve all kinds of problems of the 
parameters transmission, ensure the integrity, reliability and security of the data, and 
achieve the data transmission function between the modules. Based on the basic Modbus 
communication protocol, the system used the same structure to define the different 
command codes. According to the concrete data and the specific functions, the system 
customized a complete communication protocol, which can make the system more 
standard and perfect. In order to make the communication protocol defined by the system 
possess scalability, the functional code fields were set in the message component unit, 
and the functional code table was established to manage the functional codes defined by 
the system. The developers only need to add the functional codes in the code table when 
they want to add new function in the protocol, which will not affect the previous 
communication protocol and functions. 

In order to avoid the interference of the external equipment and ensure the confi- 
dentiality and the security of the data in the process of network communication, the 
security and encrypted transmission of the data in the smart home network are the 
importance of the research. There are three basic encryption keys of the ZigBee network 
data, including the main key, the link key and the network key. According to the actual 
application, the key is generated by the network layer and the application layer. The 
security keys can be shared between the layers, which can reduce the storage require- 
ment. The system used CC2530 chip as the ZigBee module of the wireless transceiver. 
And the data encryption and decryption of CC2530 can be achieved through the copro- 
cessor supported by AES (Advanced Encryption Standard). 


4 Design of the Monitoring System 


4.1 Design of the Functional Modules 


According to the demand analysis of the system, the smart home monitoring system can 
be divided into five main functional modules, including real-time monitoring module, 
security alarm module, scene mode, basic information management and user manage- 
ment. These five functional modules can not only achieve the real-time monitoring data 
display, alarm for abnormality, linkage control and historical data display, but also 
achieve the statistical processing, analysis and diagnosis of the massive monitoring data 
to search potential effective information intelligently. In the meantime, this system can 
provide the mode selection, information management and scalability. The overall func- 
tional structure of the system is shown in Fig. 3. 
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Fig. 3. Overall system functional structure 








4.2 Acquisition and Processing of the Server Data 


The server used multithreading serial communication technology to obtain the induction 
layer data. In the data communication process, a variety of serial operations were placed 
in different threads, and every thread was only responsible for the corresponding serial 
operation. The main thread was responsible for the coordination and management of the 
auxiliary threads, which can use the CPU in fast switching between various threads. 
Therefore, the system can achieve the concurrent execution and multi-task mechanism 
to reduce the occupancy rate of the CPU. This mechanism can improve the utilization 
of the serial ports and efficiency of data transmission, which can greatly improve the 
data request and processing ability of the server. 

In order to solve the common problems of the database like operating frequently and 
low query efficiency, the system used the cache mechanism to save the collected data. 
The cache mechanism saved the data in the cache memory. Returning data from the 
cache memory is quicker than querying database, so it can greatly improve the perform- 
ance of the server application. In the process of operation, the database will start data 
cache dependency. When the server requests the latest data collected by the sensor, it 
will write the data to the cache and update the cache. And setting the key index 
“CacheKey” when the system writes cache can improve query execution efficiency of 
the data in the cache. The concrete process of data storage adopting data cache mecha- 
nism is shown in Fig. 4. 
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Fig. 4. Flow chart of cache data storage 


4.3 Information Interaction of the Client and Server 


For the basic operations which don’t need high operation frequency and real-time 
requirements, such as user login, obtaining the equipment information, querying the 
historical data and other operations, the client used Ajax asynchronous loading tech- 
nology to invoke the Web Service from the server and obtain the basic information. In 
order to make the system client and server can transmit data cross platform and cross 
language, this system used lightweight data exchange format “Json” to achieve the data 
exchange between them. 

As for viewing and operating the data in real-time monitoring system, it needs client 
and server keep persistent connection and constantly obtain the real-time updated data 
from the sensor layer, which has high requirements for real-time communication, so this 
system designed the WebSocket full duplex bidirectional communication technology to 
achieve real-time monitoring. First, this system enabled a thread in the server to open 
the WebSocket network service and used the delegation method to open or close the 
WebSocket connection. Then it obtained the IP address of the machine and established 
the monitoring ports. Finally, the system regarded the two values as the parameter values 
of the WebSocket server address to transmit data. For the realization of the system client, 
the system used the HTML user interface. After the WebSocket connection between the 
client and the server, the two sides will be able to achieve continuous two-way commu- 
nication with a high utilization ratio. 


5 Application Test and Verification of the System 


5.1 Communication Distance and Packet Loss Rate Test 


In order to compare the performance of the ZigBee communication system in different 
environments, the test was divided into indoor test and outdoor test, which conducted a 
number of tests in different indoor rooms and outdoor open occasions to ensure the 
accuracy of the results. In each distance test process, the ZigBee terminal node data 
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transmission frequency was set as 3 s/times, and a total of 500 data was sent to the 
coordinator. The average value of the test results was shown in Table 1. 


Table 1. Communication distance and data packet loss rate of the ZigBee module 


Test Communication |Numberof |Numberof |Numberof | Average 
environment | distance(m) terminal data | coordinator | test packet loss 
transmission | data rate(%) 
reception 


Indoor 10 0 





Outdoor 0 


As we can see from the test results, by using the system based on the ZigBee wireless 
sensor network, when the communication didn’t meet interference in the outdoor envi- 
ronment, the packets loss rate was very low within a50 m range. When the node distance 
was more than 70 m, the packet loss rate will increase a little. While in indoor environ- 
ment, when the transmission distance was less than 25 m, the packet loss rate was about 
0%. The results show that the data transmission of the designed ZigBee network has 
high reliability, and it can meet the requirements of data transmission in the general 
family environment. 


5.2 WebSocket Real-Time Communication Verification 


In order to examine the communication efficiency and real-time of the WebSocket tech- 
nology which was used in the smart home monitoring system, the Ajax polling based 
on HTTP protocol and the real-time communication framework based on the WebSocket 
were set up. In this test, the data request and response information of two kinds of scheme 
in the process of communication were analyzed and compared through the Chrome 
browser network tools. Then the test case was designed by using the Jmeter pressure 
test tools, which simulated the communication between the client and the server to 
compare the network throughput and delay of two methods. 


5.2.1 Network Throughput 

In this two test schemes, the same data in test was used to ensure that the server sent to 
the client in the same actual data size. The Jmeter test tool was used to get data size 
through the Ajax polling and WebSocket, and then we can observe the changes of 
network throughput in four cases. The results were shown in Fig. 5. As we can see from 
this picture, with the increase of the load and the flow, the network traffic caused by the 
Ajax polling was very huge. The WebSocket scheme caused much smaller network 


272 X. Guo and N. Hu 


throughput than the traditional Ajax polling scheme, which has great performance 
advantages. 
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Fig. 5. Network throughput comparison of Ajax polling and WebSocket 


5.2.2 Network Delay 

After the completion of the test, we entered into the “view result tree” of the Jmeter test 
tool to count the connection time and loading time consumed in the communication 
process and calculate the network delay time of this two communication schemes. In 
order to ensure the accuracy of the test results, several experiments were carried out in 
each test case. The data delay time was collected to take the average value of the results. 
Network delay test results were shown in Table 2. 


Table 2. Average delay time of the test results 


Number of users | Real-time Number of Total delay time | Average delay 
application messages (ms) time (ms) 
scheme transmission 


l 8.02 
4.83 

100 29757 29.75 
8.36 

500 187074 187.07 
12.63 

1000 472.28 
16.48 


According to the comparison of the above two kinds of real-time communication 
schemes and analysis of the results, the real-time application based on WebSocket 
communication scheme performs better than the current widely-used Ajax long polling 





Optimization of Smart Home System 21D 


technology in the aspects of message delay time and network throughput. WebSocket 
technology can decrease the network throughput greatly, and take advantage of the stable 
long connection mode to reduce the communication delay time, which has a high effi- 
ciency in the real-time monitoring system of the smart home. 


6 Conclusion 


This paper introduced the wireless sensors into the system, which greatly simplified the 
structure of the system and improve flexibility. It used the serial debugging assistant to 
test the communication distance and the data packet loss rate of the wireless network 
set up by ZigBee equipment, and completed the ZigBee communication module test and 
WebSocket communication test in the smart home system. It verified the stability and 
safety of the ZigBee wireless network. Then, the real-time communication framework 
of Ajax polling and WebSocket were constructed respectively, and the network 
throughput and the network delay of the two schemes were tested through Chrome 
browser and Jmeter pressure test tool, which proved the advantages of WebSocket tech- 
nology in system real-time monitoring. Finally, the on-line overall test for the entire 
smart home system was conducted through the different mobile terminals. The system 
worked well and met the design requirements, which achieved the basic expected func- 
tions of the smart home system. 
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Abstract. Opportunistic Network as a novel networking, taking advantage of 
meeting opportunities of mobile nodes, completes the message transmission from 
the source node to the destination node through the way of each hop. Mobility 
model decides how the nodes move and is used to analyze network performance 
with various protocols, such as routing protocols, data dissemination protocols, 
etc. Currently many mobility models are proposed by researchers. To evaluate 
these mobility models, an analysis method is proposed. So we propose a method 
by analyzing mobile distance to assess node mobility models. This paper firstly 
introduced the commonly used mobility models based on ONE simulation plat- 
form, and then a calculation method of node mobile distance is put forward. Next, 
it was simulated and discussed that we have considered the nodes mobility 
features by mobile distance. Finally, we use the node contact duration and node 
inter-contact time as a validation, to make an evaluation for node mobility model. 


Keywords: Energy harvesting - Green communications 
Heterogeneous cellular networks - Traffic offloading 


1 Introduction 


Recent years, the energy consumption problem of communication network gradually 
becomes the research focus in the industry and academia [1, 2]. How to reduce the energy 
consumption of communication network and improve the energy efficiency of commu- 
nication system have become a hot topic in the current field of communication research. 
Heterogeneous network is expected to play an important role in the future communica- 
tion network [3, 4]. In heterogeneous network, it can be introduced a number of different 
types of small site equipment, such as micro base station (Micro), Pico base station 
(Pico), family base station(Femto) and wifi access points, it can strengthen the overlying 
cover of certain areas, improve network capacity. These small sites are often deployed 
in the place which are closer with the users, can provide service for users with far less 
power consumption than Macro. Despite the high network capacity, the dense SBSs also 
require huge power supply, causing heavy burdens to both the network operators and 
the power grid [5]. 

To deal with the cumbersome energy consumption, energy harvesting (EH) technology 
can be introduced into HCNs. Specifically, the emerging EH-SBSs, which are equipped 
with EH devices (like solar panels or wind turbines) exploit renewable energy as 


© Springer International Publishing AG 2018 
Q. Zu and B. Hu (Eds.): HCC 2017, LNCS 10745, pp. 274-287, 2018. 
https://doi.org/10.1007/978-3-3 19-74521-3_31 


The Energy Efficiency Research of Traffic Offloading Mechanism 215 


supplementary or alternative power sources, have received great attentions from both 
academia and industry [6]. How to fully utilize the harvested energy to maximize network 
energy efficiency while satisfying the quality of service (QoS) requirements is acritical issue. 

Traffic offloading problem is one of the hot issues in heterogeneous network. [7, 8] 
analyzes the relationship between the traffic offloading and customer service quality, and 
puts forward the traffic offloading incentive mechanism based on reverse auction model, 
can guarantee the quality of customer service, increase the amount of data offloading of 
the system. Literature [9] puts forward offloading users which have high requirements 
of service quality to femto to reduce the energy consumption of the macro, and analyzes 
the relationship between the system energy efficiency and the distance between the 
neighborhood. Literature [10] analyzes the system power consumption influence of 
different traffic density by different amount of micro. But [9, 10] only consider the static 
traffic loading scenario, without considering the dynamic changes of the business. [11] 
considers the offloading when in the wifi enabled small cell case. 

Although traffic offloading technology has been extensively investigated in on-grid 
cellular networks, the conventional offloading methods cannot be applied when EH is 
applied. Instead, traffic offloading technology of green heterogeneous cellular networks 
needs to be devised, the operation of each cell is optimized individually based on their 
renewable energy supply. Different from existing works, we focus on the design of traffic 
offloading for green HCNs with multiple SBSs powered by diverse energy sources. We 
aim to maximize the network energy efficiency while satisfying the QoS requirement of 
every user. To this end, users are dynamically offloaded from the MBS to the SBSs, 
based on the statistical information of traffic intensity and energy arrival rate, based on 
which the optimal traffic offloading amount are obtained. 

The reminder of the paper is organized as follows. System model is introduced in 
Sect. 2. Section 3 analyzes the power demand and supply for SBSs with EH devices. 
Then, the energy efficiency maximization problem for the single-SBS case is studied in 
Sect. 4. Simulation results are presented in Sect. 5, followed by the conclusion in Sect. 6. 


2 System Model 


In this section, the details of the HCN with different energy supply are presented as 
follows. 


2.1 Network Model 


With EH technology employed, a typical scenario of HCN is shown in Fig. 1, where 
different types of SBSs and wifi access points are deployed in addition to the conven- 
tional MBS to enhance network capacity. Based on the energy source, SBSs can be 
classified into two types: (1) HSBSs powered jointly by energy harvesting devices and 
power grid; and (2) RSBSs powered solely by harvested renewable energy (like solar 
and wind power). Wifi access points are powered only by conventional energy. Denote 
by N, and N, the number of HSBSs and RSBSs, respectively, denote by 
B, = {1,2,...,Ny}, Be = {1,2,..., Ne} the set of HSBSs and RSBSs respectively, let 
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B = {B,, Bp} be the set of all SBSs. Denote by W, the number of wifi access points, 
denote by B,, = {W,, W,,..., W,,} the set of wifi access points. SBSn serves a circular 
area with radius D,,,, and the small cells are assumed to have no overlaps with each 
other. The MBS is always active to guarantee the basic coverage, whereas SBSs and 
wifi access points can be dynamically activated for traffic offloading or deactivated for 
energy saving, depending on the traffic and energy status. 








Fig. 1. Illustration of a HCN with diverse energy sources 


2.2 Traffic Model 


The user distribution in spatial domain is modeled as a nonhomogeneous Poisson Point 
Process (PPP), whose density at time ¢ is p,(¢) in small cell n and p,,(t) outside of all small 
cells. As shown in Fig. 1, users located outside of the small cells can only be served by the 
MBS, while users within small cells can be partly or fully offloaded to the corresponding 
SBS or the wifi access points. Thus, users can be classified into three types based on the 
serving BS, wifi access points and location: (1) Macro-Macro Users (MMUs), users which 
are located outside of small cells and served by the MBS; (2) SBS-SBS Users (SSUs), 
users located within small cells and offloaded to the SBS; (3) Macro-SBS Users (MSUs), 
users located in small cells but served by the MBS. (4) WIFI-WIFI Users (WWUs), users 
located within small cells and offloaded to the wifi access points. 

As for spectrum resource, the bandwidths available to the MBS-tier and SBS-tier 
are orthogonal to avoid cross-tier interference. Denote by W„, W,, and W,, the system 
bandwidth available to the MBS, each SBS and wifi access points, respectively. At 


the SBSn, the bandwidth actually utilized is denoted as w,,,, < W, which is allo- 
cated to its SSUs equally for fairness. At the MBS, W, is further divided into 
different orthogonal portions: w,„m for serving MMUs and w,,, , for serving MSUs in 


small cell n, where w, + pa W isn < W,, At the wifi access points, the band- 


n=1 msn — 


width actually utilized is denoted as w,,,, < W.,. 


ww,n — 
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2.3 Wireless Communication Model 


If user u is served by SBSn, its received SINR is given by [12] 


Py, wW dru hpu 


= su “m mu 1 
ssn We (6 + Loew, (1) 





where P, is the transmit power level of the SBS, w, is the bandwidth allocated to user 
u, d,,, 1S the distance between user u and SBSn, a, is the path loss exponent of the SBS- 
tier, h „„1s an exponential random variable with unit mean reflecting the effect of Rayleigh 
fading, 0, is the ratio of inter-cell interference to noise among SBSs, and o* is the noise 
power density. As each SBS allocates bandwidth equally to its associated users, the 


achievable data rate of a generic user u is as follows: 


Vos nu = Wss nu log,( + E (2) 


where w denotes the bandwidth allocated to each user from SBSn. 


SS,nuU 


Similarly, if user u is served by the MBS as a MMU or MSU, its received SINR is 
given by 
F TmW dau ee 


m u 


W, (6,,+ Dow, 


m 





Ym u = (3) 


where P, is the transmit power level of the MBS, d,,, is the distance between user u 
and the MBS, «æ, is the path loss exponent of the MBS-tier, 0, is the interference to noise 
ratio from other MBSs, and h,„, reflects Rayleigh fading with the same probability 


distribution as h,,,. 
Then, the achievable data rate of user u is given by 


Immu = Wmmu O20 + Ymu) Jor MMU (4) 


Foca = Wpsnu 1081 + ypa) for MSU (5) 


ms,nu 


where w and w denote the bandwidth allocated to each user from MBS. 


mm,u ms,nu 


If user u is served by wifi access points, its received SINR is given by 


Pow, dph, 


— wee w w 6 
Pont We 6 + ow, to) 





As each wifi access point allocates bandwidth equally to its associated users, the 
achievable data rate of a generic user u is as follows: 


Fw nu = Ww,nu log (1 ki Pean (7) 
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3 Analysis of Power Supply and Demand 


3.1 Green Power Consumption 


BSs can work in either active mode or sleep mode, with different power consumption 
parameters. According to [13], the power consumption of a BS in active mode can be 
modeled as: 


Ww 
Pgs = Pot wer (8) 


where Po denotes static power consumption including the baseband processor, the 
cooling system and etc., suppose the static power consumption of wifi access points is 
0, coefficient p is the inverse of power amplifier efficiency factor, W is the available 
system bandwidth and w is the bandwidth of utilized subcarriers. P, is the transmit power 
level and is treated as a system parameter, while we control the RF power by adjusting 
the utilized bandwidth w. 


3.2 Green Energy Supply Model 


In this work, we optimize the amount of offloaded traffic, the ON-OFF state, and the RF 
power of each SBS at the large time scale, based on the stochastic information of traffic and 
energy (1.e., user density and energy arrival rate). The optimization is conducted for each 
period independently [13], and thus we can focus on the optimization for one period. 

Two time scales are considered for the problem analysis. In the large time scale, we 
divide the time into T periods (e.g., T = 24 and the length of each time period is | h), 
and assume the average energy harvesting amount and user density remain static in each 
time period, but may change over different periods. During period t, the arrival of 


renewable energy packets is modeled as Poisson process with rate A, ,(t) for SBSn, and 


the distribution of user in small cell n follows PPP with density p,, (7). 
Discrete energy model is adopted to describe the process of energy harvesting, and 


a unit of energy is denoted by E [14]. Denote by 4g; „(t) the arrival rate of per unit energy 
at SBSn and time t. The harvested energy is saved in its battery for future use. The 
battery is considered to have sufficient capacity for realistic operation conditions, and 
thus we assume no battery overflow happens. For RSBSs without grid power input, they 
have to be shut down when the battery runs out. Consequently, the corresponding users 
will be served by the upper-tier MBS for QoS guarantee. Note that handover procedure 
is conducted when the RSBS is shut down or reactivated, causing additional power 
consumption. For HSBSs, they can use the on-grid power when there is no green energy, 
until renewable energy arrives. 

The energy supply and consumption process of each SBS can be modeled by a queue, 
where the queue length denotes the battery amount [15]. Based on the power consump- 
tion model of BSs (Eq. (8)), the equivalent service rate of per unit energy for SBSn is 
given by 
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1 Wss nt) 
—(P + e 
( Cs,n W 


Ur, (t) = F 


PsnP Tsn) (9) 


where Prsn Pc,, and f,,, are the transmit power, constant power and power amplifier 


Ts,n? 
coefficient of SBSn, respectively. U, is called energy consumption rate in the rest of this 
paper for simplicity. 

For a SBS with EH, the variation of battery can be modeled as a M/D/1 queue with 
arrival rate 4, and service rate U,, given by Eq. (9). In what follows, we analyze the stable 


status of the energy queue. For the M/D/1 queue, the embedded Markov chain method 
is usually applied to analyze the stable status [16]. When 1,,/U, > 1, the queue is not 
stable and the queue length goes to infinity, which means that the harvested energy is 
always sufficient. When 1,,./U, < 1, the stationary probability distribution of L* can be 
derived by Pollaczek-Khinchin formula, 


=| Ag 
qo = U, (10) 
Àg 
J 
qı =(1- ye Ue —1) ca 
Urg 
AE À À Ag L*-k Ag b-L*-1 
=(1- Jey Ue +5 a Us i + oo z }} (L* > 2) ee) 
ae £ C-H -L-I 


Thus, the stationary probability distribution of the energy queue length (1.e., the 
amount of available green energy) is derived. 


3.3 Outage Probability Analysis 


Service outage happens when the user’s achievable data rate is less than the requirement 

Ro, we are interested in analyzing the outage probability constraint for SSUs, MMUs 

MSUs, and WWUs respectively, based on which the power demand can be obtained. 
Service outage constraint of SSUs can be written as 


Ws nu’ sonu > Ro (13) 


where w..,. is the bandwidth of SSU, t 


SS, NU SS, NU 


users given by [13]: 


denotes the spectrum efficiency of cell edge 


Pryn a. + 2 n 
(0, + 1) 207W, D? 











= log,( + ) (14) 


Tss nu 


the physical meaning of Eq. (13) is that the average data rate of the non-cell-edge users 


(with spectrum efficiency above T snu) Should be no smaller than Ro. 


S,nu 
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Then 


Wee > Ro (15) 


mm,u “mm,u — 


Pin Gmt2 N 
(0, +1) 207W,, D?” 











Poar log,(1 + 


m. 


) (16) 


where w ,„m.„ 18 the bandwidth of MMU, the SBSs are considered to be uniformly distrib- 


uted in the macro cell. 
In addition, the area served by MBS except SBSn is 


DV 2 2 
rD, = rD = » aD. (17) 
sEB 


Next, we consider an MSU. The outage probability constraint of the MSUs can be 
written as 


W ms,uTms,u 2 Ro (18) 
where 
eee: Pe G@+2 H 
T = JO n a a a a a 
msnu OE" 6, + 1) 202W, DE, a 
Wms, 1S the bandwidth of MSU, and D„s„ denotes the distance between the MBS and 


SBSn. 
The outage probability constraint of the WWUs is 


WwuTw,u 2 Ro (20) 
where 


mt a, + 2 y 
(6, + 1) 20W, D® 











Tyu = logo + ) (21) 


4 Energy Efficiency Maximization for Single-SBS Case 


In this section, we optimize the traffic offloading for the single small cell case, where 
the HSBS and RSBS are analyzed respectively. 


4.1 Outage Probability Analysis 


For the single-HSBS case, the network energy efficiency is 


TH 
P 


sum 





EE = (22) 
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TH is the network throughput [17], the network throughput is defined as the total amount 
of data that is successfully received by the users per unit time and area, and is measured 
in bits/sec/m’, P „m18 the total on-grid power consumption. 























Vinm = Pm xD, Cal (24) 
rs =W p, 1D? Pou (25) 
Ving = yp, D? Tani (26) 
r,=U-y,- yp, nD? Tsu (27) 


where Iy is a 0-1 indicator denoting whether SBSn is active or not, r,,,,, 1s the total data 
rate of MMUs, r,, is the total data rate of SSUs, ae is the total data rate of MSUs when 
the SBS is active, r is the total data rate of MSUs when the SBS is not active. ro „1S the 


total data rate of WWUs when the SBS is active, r? „1S the total data rate of MMUs when 
the SBS is not active. yis the traffic offloading ratio of the SBS. y is the traffic offloading 
ratio of the MBS. 


Poum = Pups + Puses + P wif (28) 
where Pugs» Pusgs and Pip denote the on-grid power consumptions of MBS, HSBS and 


wifi access points respectively. P ugs:Pusgs and Pip can be derived based on Eq. (8). 


While w,,,,, is constrained by Eq. (12), w? and w° are the corresponding bandwidth 
needed by the MSUs. Then we have 


PmP Tm 
W 


m 





Pime LoF (Wmm + lyw, + (l -Ly)we ) (29) 
In addition, as the HSBS consumes on-grid power only when the battery is empty, we 
have 


Ln IoPcs + Poin a) > Àg < Ur 
Pusps = W, (30) 


0 An > U; 


where qg is the probability of empty energy queue, obtained from Eq. (10). 
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Then 


p Ta Tw 
Www 
W 


w 


P nin = (31) 


Then, the network energy efficiency maximization problem of the single-HSBS case 
can be formulated as follows: 


P1: 
-i (32) 
s.t. 
Doni Ro, ge Ro, re Ro, re Ro (33) 
O < Wam + Lywe + - LW < W,, (34) 
O<w,, <W, (35) 
O<w,,, < Wp (36) 


where Eq. (33) guarantees the QoS, Eqs. (34), (35) and (36) are due to the bandwidth 
limitations of MBS, HSBS, and wifi access points, respectively. U,, can be derived based 
on the power consumption model of HSBS Eq. (9). 

In practice, the HSBS usually provides higher energy efficiency compared with the 
MBS, due to shorter transmission distance and lower path loss. As a result, more subcar- 
riers should be utilized to offload more users if the HSBS is active, Besides, 
Wm + W? > Wn happens when the MBS is overloaded, in which case the HSBS should 
be activated to relieve the burden of the MBS. But the SBS can only service the amount 
while w, < W, 


4.2 Single-RSBS Case 


Unlike HSBS, RSBS do not consume on-grid power. Whereas, the SSUs have to be 
served by the MBS when the battery is empty, which causes handover and on-grid power 
consumption. The average power consumption is given by 


Poum = Pups + Pro + Pwig (37) 


where P ygs 18 the power consumption of the MBS, and P yọ reflects the additional power 
consumed by SSU handover. 

Denote by J, E {0, 1} the ON-OFF state of the RSBS. If RSBn is active, handover 
happens in the following cases: (1) RSBn is shut down when the battery runs out; (2) 
RSBn is reactivated when new energy arrives. According to the energy queueing model, 
the first case corresponds to the event when L* transits from 1 to 0, with frequency of 
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q,A,,U, after the energy queue becomes stable. Due to the duality between the two 
cases, the additional handover power consumption is given by 


Âg 


Àg E7 
Puo = Ir: 24,421 UeCHo = a oe Te CHO Ae < Ur (38) 
0 Ar 2 Ur 


Note that SBSn may be shut down due to energy shortage even when its state is set 
as on, in which the MBS has to utilize more bandwidth to serve SSUn with additional 
bandwidth. Based on Eq. (8), the average on-grid power consumption of the MBS is 
given as follows: 


PnP Tm 
Pups = Pon + 





(Wmm + IRC — dow + dow.) + C1 —Tp)we ) (39) 


m 


where Wm 1s constrained by Eq. (15), w* and w? denote the bandwidth needed by the 
MBS to serve MSUs when the RSBS is active and shut down, respectively. 
The average on-grid power consumption of the wifi access point is given as follows: 


p wr. Tw 
Www 
W 


w 


Pig = (40) 


Thus, the energy efficiency maximization problem can be formulated as follows. 
P2: 


ae (41) 
s.t. 
Pami 2 RO ey ol nma = Rot o (42) 
O < Wam tw < Wn (43) 
O<w,, < W, (44) 
O<w,, < W, (45) 


where the objective function is the network energy efficiency, Eq. (42) guarantees the 
QoS, Eqs. (43), (44) and (45)account for the bandwidth limitation of MBS, RSBS and 
wifi access points, respectively. 
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5 Energy Efficiency Maximization for Single-SBS Case 


In this section, we evaluate the energy efficiency of the optimal solution for the single- 
SBS case. The SBSs are micro BSs, and the main simulation parameters can be found 
in Tables land 2. Solar power harvesting devices are equipped at RSBS and HSBS. 


Table 1. Power model parameters 


Transmit power P,(w) | Constant power P.(w) | Coefficient p 





Table 2. Simulation parameters 


Parameter Parameter | Value 


100 m 


S 
S 


nA 


3 1000 m 


S 
O 
= 
S 
B 
S 
= 


= 
oa 
Nn 

= 
P 


= 
nA 
= 


a, 5 300 kbps 
o? n 0.05 

o, 9n 1000 

0, A 20 MHz 
w, w, | MHz 


5.1 Network Energy Efficiency of Single-SBS with Different User Density 


Figure 2 shows the network energy efficiency of no offloading and offloading different 
users with different user density under given system parameters. 

Simulation shows with the amount of offloading traffic to wifi access points 
increasing, the energy efficiency is decreasing, regardless of the energy arrival amount. 
Because SBSs are powered by conventional energy and green energy, wifi access points 
are powered only by conventional energy and the radio of output and power consumption 
decreases quickly. So the following contents are not including wifi case. 

The red line and black line with no point show the no offloading case. It can be seen 
that energy efficiency generally increases with the increase of offloading ratio and with 
the increase of users, and offloading’s is larger than no offloading. Besides, in the actual 
situation, w,, > W, happens when more users are offloaded to the SBS, in which case 
not all users can be offloaded to the SBS for satisfying the QoS of every user, so we can 
choose some users whose energy efficiency is maximized to offload to the SBS. For 
RSBS, because of difference of user density, the energy efficiency’s variation trend is 
different. Figure 2(b) shows when the user density is large enough, energy efficiency is 
maximized with the offloading radio is not until 1. 
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Fig. 2. The energy efficiency with different offloading ratio and different user density of single 
SBS. (Color figure online) 


5.2 Network Energy Efficiency of Single-SBS with Different Energy Arrival Rate 


Figure 3 shows the network energy efficiency of no offloading and offloading different 
users with different user density and different energy arrival rate under the same system 
parameters. Figure 3(a) shows the energy efficiency for the HSBS, which increases with 
the increase of energy arrival rate in the small cell and with the increase of user density. 
Figure 3(b) shows the energy efficiency for the RSBS. It can be seen the energy effi- 
ciency’s variation trend is different with the increase of energy arrival rate, when energy 
arrival rate is small enough, the energy efficiency 1s maximized with the offloading radio 
not until 1. 
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Fig. 3. The energy efficiency with different offloading ratio and different energy arrival rate of 
single SBS. 


6 Conclusions and Future Work 


In this paper, we have investigated the network energy efficiency through offloading 
traffic for green heterogeneous networks. The analytical results reflect maximized 
energy efficiency with different user density through traffic offloading, we also analyze 
offloading case of different energy arrival rate, so we can choose how much traffic should 
be offloaded to the SBS and wifi access points with different energy arrival rate. But we 
do not consider the case that all SBSs coexist, which SBS should be activated and how 
much traffic should be loaded, next we will achieve this with new limits. 
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Abstract. Cellular heterogeneous networks comprising small-cells coexisting 
with macro-cells have emerged as a promising solution to improve in-building 
coverage and capacity of wireless networks. By assuming the scenario of cog- 
nitive radio enabled cellular heterogeneous network (CCHN), where the small 
base stations (SBSs) carry out spectrum sensing, severe cross-tier interference 
are portable, however, the joint design of spectrum sensing and access poses 
new technology challenges. Furthermore, the joint optimization encompasses 
affluent heterogeneous nodes and connections, which renders the centralized 
paradigm messy for enormous signaling and great computation. In this paper, 
we propose a decentralized approach by formulating the non-cooperative power 
allocation game (NPAG) wherein each SBS competes for the maximization his 
own opportunistic throughput by joint choice of the sensing factors and the 
resource allocation strategy, involving interference and energy constraints. 
Further, the iterative water-filling (IWF) algorithm is utilized to deal with the 
non-convexity of the game. Simulation results show that the proposed game 
theoretical formulations yield a considerable performance improvement for the 
joint optimization problem. 


Keywords: CCHN - Cognitive radio - Joint optimization - Game theory 


1 Introduction 


Due to the ability to bring about massive spatial reuse of frequency, the low-power, 
short-range mini-base stations (BSs), referred to small-cells (femto-cell, pico-cell) are 
increasingly important for improving network capacity. These small-cells coexist with 
conventional macro-cell networks, giving rise to cellular heterogeneous networks 
(CHNs). The structure of CHN causes interference problem to each layer. The inter-tier 
(inner-tier) interference (ITI) from the Macro BSs (small-cell BSs) to the small-cell 
users can be critical challenge to the CHN. Severely, the lower spectral efficiency 
(SE) and higher interference often results in great loss in the capacity and coverage. 
Thereby, it is necessary to improve SE and reduce interference. 
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Apparently, small-cells provide the flexibility of network topology while cognitive 
radio (CR) enables the agile usage of spectrum resource and helps small-cells avoid 
major interference sources. The CHN will be favored if we combine small-cell and 
cognitive radio technologies, which is the CR enabled Cellular Heterogeneous 
Networks (CCHNs). However, challenges arise for how to implement CR in CCHNs. 
(1) The duration of the spectrum-sensing in one slot becomes the tough dilemma. The 
longer sensing time helps higher sensing accuracy (SA), however, it also causes less 
time for the data-transmission period. (2) The sensing threshold affects SA and SE 
inversely, i.e., the higher sensing threshold means more accurate sensing results, but 
lower spectral efficiency, since access opportunities are likely to be missed. Hence, the 
design of the sensing factors should consider the tradeoff between SA and SE, so that 
the access availability should be optimized. Resource allocation strategy for reducing 
interference is another important issue that must be addressed to increase SE. 
Summarily, the mutual suppression and influence between sensing and access have 
motivated us to investigate the joint optimization of spectrum-sensing configuration 
and resource allocation in CCHNs. 

The idea of considering sensing and resource allocation jointly has been investi- 
gated in conventional cognitive radio network (CRN). Several previous works [1, 2] 
have already considered the tradeoff between sensing capabilities and throughput of 
secondary users. Some researches solve the joint optimization problem using the 
convex optimization theory, such like [1-3], and some formulate the joint problem 
using the game theory, such like [4, 5]. There are still some other papers adopting smart 
algorithm, to name some, greedy algorithm [6], evolution algorithm [7] to solve this 
kind of problem. 

Also, there are plenty of literatures put insights on the design of spectrum sensing 
method or access strategies in CCHNs. The authors of [8] present one cycle 
stationary-based spectrum sensing method, the design of which is based on the study of 
detection probability (DP), ignoring the jointing research of resource allocation. In [9], 
the authors provide spectrum sharing algorithm according to existing sensing results, 
basing on the model of ALOHA, DSMA. However, the issue of joining optimization in 
CCHNs has barely been addressed due to affluent heterogeneous nodes and connec- 
tions, the messy interference environment, etc. 

Based on the aforementioned related work, in this paper, we propose a decentral- 
ized approach by formulating the non-cooperative power allocation game (NPAG) 
wherein each SBS competes against the others by choosing jointly the sensing factors 
and the resource allocation strategy. Then, a pricing mechanism with constraints of 
sensing accuracy and power consumption is introduced to reduce the interference 
caused by the sensing errors. To deal with the non-convex issue, we compute the 
optimal allocation strategy under given sensing time using iterative water-filling 
(IWF) algorithm, then, select the optimal solution from multi groups of sensing factors. 
Finally, the performance of our proposed algorithm is analyzed by making comparison 
with average power allocation (APA) algorithm in system level simulation. 
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2 System Model 


2.1 Network Scenario 


We consider a CCHN consisting one MBS and multiple cognitive FBSs, where 
femto-cells coexist with macro-cell in overlay approach. Each FBS operates in the 
closed access mode where only authorized UEs are allowed to access to it. The sets of 
FBSs and MUEs are denoted by S" = {i,i=1,2,...,Nr} and SY” ={j,j= 
1,2,...,Nw}. The femto-cells share same spectrum while they adopt the OFDMA as 
the spectrum allocation strategy in each cell. Specifically, the total spectrum band is 
divided into Nc orthogonal sub-channels, marked as S© = {n,n = 1,2,...,Nc}. We 
focus on block transmissions over channels which are assumed to be perfectly syn- 
chronized. With a fixed length T, the transmission frame of FBS 7 is divided in two 
slots: sensing slot t; and data slot as shown in Fig. |. During sensing interval, FBSs 
sense the environment, looking for the spectrum holes, whereas during the data slot 
they possibly transmit over the licensed spectrum detected as available. The sensing 
and transmission phases are described in 2.1 and 2.2 respectively. 


Time slot k Time slot k+1 














T; T-7, T, T-r, 
e—a MMM 
Sensing time Access time Sensing time Access time 


Fig. 1. Frame structure of the cognitive FBSs 


The channel model follows Rayleigh fading and the large-scale path loss. The 
additive complex white Gaussian noise is also assumed (with mean being zero and 
variance being o°). The channel fading between FBS i and its authorized UEs on the n™ 


2 
8&M sul represents the channel 





— e” 2 . 
subcarrier is normally indicated as | Zin! . Meanwhile, 


fading between MBS and MUE j on the n™ subcarrier, and | Bijal stands for that 
between the FBS 7 and the MUE j. Besides, the channel gain between MBS and the 


l r 2 . 
UEs in FBS i is denoted as | guin| . In CCHNs, the interference brought by the 
mistaken sensing between the BSs in the different tier is not considered when 
restricting sensing accuracy. Therefore, we just analyze the inner-tier which is depicted 
in Fig. 2. 


The Spectrum Sensing Problem 
Assume FBSs sense the licensed channels using energy detection in non-cooperative 
way. In sensing phase, we define the event of MBS’s occupation as follows: 


H,: The MBS takes use of the channel. 
Ho: The MBS is idle on the channel. 
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Fig. 2. The CCHN with inner-tier/cross-tier interference 


The performance of the energy detector scheme is measured in terms of the DP 
Pa, én) and false-alarming probability (FP) P „(Ti, €&in). DP and FP are the prob- 
abilities detecting that the channel is busy under the case Hı and Ho. Here, employing 
the results of [1], we write PG, Ein) and P „(Ti, &in) as in (1)-(2) respectively, where 
t; denotes the sensing time of FBS i, and ¢;,, is the decision threshold of FBS 7 on 
sub-channel n, and u is the sampling rate. Besides, G, = (\ginl lgonl — lgnwenl| 
(superscript T means transpose operation). / means an identity matrix, and Diag(G,,) is 
a diagonal matrix formed by elements in G,,. Q(.) is the Q-function and ||.||' of a matrix 
denotes its entry wise l-norm, which is the summation of absolute values of all the 
elements. 


. — [o ; 
o\/2\|07I +2Diag(G,) || Nc 
a ET = o( en Nv (2) 


Deriving from (1), p „n(Ti, €in) can be further expressed with respect to a, Ein) 
as (3), where X„ = o?°I + 2Diag(G,). 


Q! (PE (ti Sin) Joy 22n Ne + [Gull VF 
0 RINENE 


Par =O 


In order to secure the performance of MUEs, it is rational to set system thresholds 
of the minimal DP and maximal FP. Note that if and only if the received power from 
MBS on this channel is less than the spectrum sensing threshold ¢;,,, can the FBS uses 
the channel n. Thereby, ¢;, should be carefully gauged to improve the access oppor- 
tunity to make sure that the MUEs do not get severe interference. 
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P Tutia) > Pt in = S i = S! (4) 
Pl (ti, Ein) < Phn € SSi E S” (5) 


The Transmission Phase 
Cognitive FBSs coexist in the network operating in a non-cooperative manner, and no 
centralized authority is assumed to handle the network access for the FBSs. The 


n - l N 
transmission strategy of each FBS 7 is then the power allocation vector p; = Pal 


. N . ! 
and the channel allocation vector B,; = { i ae over the Nc subcarriers (pin is the 
power allocated over channel n, P; „ is the sub-channel allocation indicator that it equals 
lwhen the FBS 7 is allocated with sub-channel n, and zero not). 


Successful detection: The goal of each FBS is to optimize its opportunistic spectral 
utilization of the licensed spectrum. According to this paradigm, each sub-channel n is 
available for the transmission of FBS 7 if no MBS’s signal is detected over that 
frequency band, which happens with probability 1 — P „(Ti, Gin). This motivates the 
use of the aggregate opportunistic throughput as a measure of the spectrum efficiency 


of each FBS i. Given the power allocation profile p= (poi of the FBSs, the detection 


thresholds z; = CANF and sensing time qT;, the aggregate opportunistic throughput of 
FBS 7 is defined as 


Nc 
— Ti 0 E =e 
Ri(ti, €i, Pi, B;) = (1 — 7) Do Pre) (1 — PL (ti, Bin) ) Pin Bins Bin) (6) 


where 1 — t;/T(t;<T) is the portion of the frame duration available for opportunistic 
transmissions, Pl n(Ti, fin) is the FP defined in (2), and rin(Pin, Pin) is the maximum 
rate achievable of FBS i over channel n when no MB%S’s signal is detected and the 
power allocation of the FBSs is given. The maximum achievable rate 7; n (Pin, Pin) 1S 
expressed as 
Pin Sin 7 
Fin(Pin Bin) = Bin log + a (7) 
of + > Pin|Biin 
J#i 


where o is the variance of the background noise over channel n at the receiver UE 
associated to FBS i (assumed to be Gaussian zero-mean distributed). 


As mentioned above, igs , can be written as (3). According to [1], when 
E ĉin) = P. (6) can achieve maximal value. Thereby, (7) can be rewritten as the 
following one: 
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R(t; pabi) = (1 — 2) > Pr(H) (1 — PL (a) ) Fin (8) 


Here, 
f — Q`! (Ph)0 2||2n|| Nc T [Gall LT; 

pg eee (9) 
o2.,/2NrNc 


Energy constraints: To control the total power cost by FBSs, we propose to impose 
individual energy constraints. Individual constraints are provided at the level of each 
FBS iż to limit the energy constraint. Individual energy constraints: 


N N 
: - S T =, i oo < se 10 
i> Pi, +(T—-t) > pin SE (10) 
n=1 n=l 


where p} ,„ is the sensing power of FBS 7 on channel n, and £7" is the maximum energy 
limitation allowed to be generated by FBS i. 


Interference constraints: To control the interference radiated by the FBSs, we pro- 
pose to impose interference constraints in the form of individual constraints, which is 
imposed at the level of each FBS 1 to limit the average interference generated at the 
MBS. Individual interference constraints: 


3 pris ate Pin < Sia (11) 


where /*"** is the maximum average interference allowed interference allowed to be 
generated by the FBS i, and prs = = | — P% represents the probability of FBS i failing 
to detect the presence of the MBS on the sub-channel n and thus generating interfer- 
ence against the MBS. 


3 Joint Spectrum Sensing and Access Optimization Strategy 
via Game Theory 


In Sect. 2, we have defined the opportunistic throughput R; and the individual con- 
straints (10)-(11). In the process, we assume that FBSs aim to maximize their own 
throughput non-cooperatively, with the constraints holding. Therefore, a non- 
cooperative game theory can be introduced to build an optimization model if we 
regard each FBS as a game player and R; as the player’s utility. Hence, in this section, 
we focus on the formulation of the jointing optimization of the spectrum sensing and 
resource allocation of the FBSs within the framework of game theory. We propose next 
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equilibrium problem: games with individual constraints, and then provide a unified 
analysis of the game. 

As mentioned above, each FBS is modeled as a game player who aims to maximize 
its own opportunistic throughput R;(7;, p;,B;) by choosing jointly a proper power 
N N 
i on 
sensing time t;, subject to individual constraints (10). 

In order to enforce the constraints (10)-(11) while keeping the optimization as 
decentralized as possible, we propose the use of a pricing mechanism through a penalty 
in the utility function of each player. To be simple, we set the price of the 7” player is 
denoted by y;. Assume price y; in (12) is given. Therefore, we have the formulation (12). 

Player i’s optimization problem is to determine, for the given y, a tuple (7;, p;,B;), 


allocation strategy p; = {Din carrier allocation strategy B, = { Bin} and 


Nc 
maximize U;(t;, p,B;) = Ri (ti, p,B:) — 9; - ( ` (tiPin + (T — spun) (12) 


Ti; PiP; n=l 
s.t. 
Nc Nc 
Ti SP, aL Gy) Sin —_ (13) 
n=1 n=l 
Tmin <1; <T, <n € S© (14) 
No > 
NO pt l emin| Pin < B (15) 
n=1 
Here, (14) is derived from (2), (3), (5). Tmin = max(tl;,, Thin,- TAE), 
no (O°! (PI) 0? VINFNe —O7! (P4,)o, A22 hN) 
“min ~ liG 


For notational simplicity, we will use @; = (t;, p;,B;) to denote the strategy tuple of 
player i. 


4 Solution Analysis 


This section is devoted to the solution analysis of the game (12). We start our analysis 
by introducing the definition of non-cooperative power allocation game (NPAG). 

According to (12), we can determine the optimal sub-channel matrix B once the 
network vector p is determined. In addition, to simplify the jointing problem, we 
compute the optimal power allocation vector with different sensing time, and then 
choose the optimal tuple (7;,p;,B;) as the optimal solution. Therefore, the jointing 
problem of sensing time and resource allocation turns into a power allocation problem, 
and the game becomes a NPAG. G = {S", {pi }icgr, {Uiticgr } as follows: 
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Nc 
NPAG :max U;:(t:, p,B;) = max (re P.B;) — 7; ( S api, + (T - pu) ) (16) 


n=l 
st. 
p; € P; (17) 


Here, 


N 
P= fpa Sop T — 1) a > ot leaf nn <i | (18) 
n=1 


We call a network power vector p a Nash equilibrium if for every i€ S’, 
Ui(ti, Pi P_;, B;) > Ui(ti, qi p_;, B;) for all P; € P;. In other words, at a Nash equi- 
librium, given the transmission power vectors of other FBSs, no FBS can improve its 
utility level by changing its own transmission power vector unilaterally. 

For the given sensing time 7;, the utility U;(t;, p, B;) is a convex function of p; and 
hence the problem is a convex problem. Therefore we can apply the Karush - Kuhn- 
Tucker (KKT) condition to get the solution. 

The Lagrangian function of FBS 7 is given as follows: 


Nc 
Li(p)=A (e _ («3 St, + n E Ti) rs) Sp) + U;i(p )+ li (im Spt lems fn) (19) 
n=1 


n=l 


Then, in the NPAG, the best response of FBS i is given by 


2 me a 
(1 = P rH) (1 — PI (t) ) Bi T Y Pinl8rin 


(m (7 = ti) — A(T - t) — upt emin] ) In? Bial 
Nc 
afam — ($ 3 + =g) Sona) ) = 0, 24:20, 
n=1 
em St _ 0, u; >0. 


where 2; and u; are the Lagrangian multipliers for the maximum energy constraint (10) 


Pin = 


(20) 


and interference constraint (11). [x] * = x if x>0 and 0 otherwise. 

Unfortunately, it is hard to find the max Lagrangian multiplier to solve the p;, due 
to two multipliers. To get around the difficulty, we consider an iterative water-filling 
algorithm (IWF) mentioned in [10]. The optimal response of game can be found by 
computing the multipliers iteratively until they reach a certain degree of accuracy. 

The details of IWF algorithm is listed in Algorithm 1. 
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Algorithm 1 IWF Algorithm For Power Allocation 


1 Initialize the initial value of power, number of user-channels in power convergence state, 
iteration number of FBS 7 on channel n, the flag bit of convergence and accuracy 


requirement by p,,=I>"/N., count=0 ,k,,=0  flag,,=0, 6=107 


separately. 
2If count!= N N, 


3 For n=1:N. do 





























d For i=1:N, do 
5 If flag, ,, =1, count + +; continue; 
6 Else 
7 P; E E P 
oL oL 

8 k,, ++; step _ A = — (p; ); step _ u = — (P; ); 

T p 54 P) pH ay, (Pi) 
9 A, =A, +stepxstep _ A; 4, = U; +stepXstep_Ms Pin = Pin; 
10 

T: 2 ai 
0- PH) (1-24) P? +9 Pjn|8 jan 
i j+i 
Pin miss 2 7 i 
(ZA T-7)-4 (T-7;)-1 PM Borin] )m2 Bis 

11 End If 
12 If (p =0) &&abs TE =P )/ Pin < e) kia Z Kit 
13 flag, , =i: 
14 Else continue 
15 End If 
16 End For 
17 End For 


18 End If 
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The sub-channel allocation strategy ;,, can be achieved by the corresponding 
values of pin. Pin = | when pin > 0, vice versa. Furthermore, optimal solution @; can 
be achieved by making comparison between multiple sets of (p,B) with the given 7;. 


5 Simulation Results and Analysis 


In this section, simulation results are demonstrated to illustrate theoretical findings. 
After testing the convergence of the proposed power allocation algorithm, performance 
results with respect to throughput, EE and utility function are present, with the average 
power allocation (APA) algorithm as comparison. The system is set up according to the 
following simulation parameters (Table 1): 


Tablel. Simulation parameters 



































Parameter Value 
Number of MBS Ny 1 
Number of FBS Nr 3 
Number of Channel Nc 5 
Interference Threshold [;"** 0.1 w 
Maximal Iteration Number kmax | 60 
Accuracy Requirement ô 10° 
Sampling Rate u 8 MHz 
Frame Length T 20 ms 
DP Threshold P4, 0.9 
FP Threshold Pi 0.2 
Energy Threshold £P 2 mj 
Sensing Power P;,, 1 mw 





0.055 





transmission power/w 





0.03 1 | 1 | 1 
0 10 20 30 40 50 60 
iteration number 


Fig. 3. Convergence of the proposed power allocation algorithm 
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Figure 3 presents the convergence of the proposed power allocation, where n and 1 
denote the index of channel and FBS. It can be seen that the power of FBS on each 
channel declines gradually, and then keeps smooth. Meanwhile, the system gains the 
stable performance. In the proposed algorithm, the power converges within 10 times to 
achieve 0 accuracy. 


throughput/(bit/s/Hz) 
utility function U 








0 5 10 15 20 





sensing time/ms sensing time/ms 
Fig. 4. The comparison of throughput ver- Fig. 5. The comparison of utility function 
sus sensing time versus sensing time 


From Figs. 4, 5 and 6, we provide the comparison of two schemes to show the 
effect of sensing time on femto-cells’ throughput, utility function and the energy 
efficiency (EE). As depicted in Fig. 4, the throughput firstly increases with the growth 
of ti, and then decreases to the zero for the proposed IWF algorithm. That is because 
the higher ti which means the lower FP, improves the availability of spectrum resource 
and gains more throughput by making more FBSs access into the idle channels. 
However, when the sensing time reaches to some threshold, it makes little effect on FP 
and throughput. Contrarily, the shorter of transmission time reduces the transmission 
efficiency and degrade the system performance. The same trend can be found in utility 
function as illustrated in Fig. 5. It is not difficult to notice that the price factor Ai will 
execute self-regulation according to the total power, aiming to limit the power of each 
FBS. In contrary to throughput, the utility function achieves a lower value due to price 
mechanism. 

In Fig. 6, curve of EE keeps the same trend as in Fig. 5. Furthermore, [WF 
algorithm is better than APA algorithm. That is because, in context of guaranteeing 
throughput, [WF algorithm strongly reduces the total value of power allocation. 

Figure 7 shows the increasing trend of throughput with the growth of inner SNR. 
Since the higher SNR brings higher power allocation, which improves throughput of 
femto-cells. While, with the increasing SNR, the inner-tier interference also become 
larger, which limits the improvement of throughput. Thereby, the throughput is no 
longer varied by the inner SNR. 
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Fig. 6. The comparison of EE versus sens- Fig. 7. The comparison of throughput versus 
ing time inner SNR 


6 Conclusion 


In this paper, the problem of optimal multichannel sensing (with respect to sensing time 
and threshold) and resource allocation (with respect to channel and power) in CCHN is 
explored by formulating a game and the FBSs act as the players to maximize his own 
opportunistic throughput by choosing jointly the sensing and access configurations. 
A pricing mechanism that penalizes the FBSs in violating the power constraints is also 
provided. Finally, we deal with the non-convexity of the game by utilizing the iterative 
water-filling (IWF) algorithm. Simulation results show the proposed IWF algorithm 
gains better performance in both femto-cell throughput and EE than traditional APA 
algorithm. 
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Abstract. The increasing traffic demands in the past few years require further 
expansion of cellular network capacity. Densely deployed base stations may cause 
a large amount of energy consumption and lots of environmental problems. 
Relaying is considered as a promising cost-efficient solution for capacity expan- 
sion and energy saving due to the physical characteristics and low power require- 
ments of the relay nodes. In this paper, we study the energy harvesting relay node 
deployment problem in a cellular network where the user distribution may change 
with time. We propose a greedy relay nodes deployment algorithm to satisfy the 
network capacity requirements in different time and put forward an operation 
optimization algorithm to make full use of renewable energy. Simulation results 
show that the network capacity can be increased significantly and the energy 
consumption can be reduced dramatically by the proposed algorithms. 


Keywords: Heterogeneous cellular network - Capacity expansion 
Energy harvesting - Relay node deployment 


1 Introduction 


In the recent years, the rapid rise in the popularity of smart devices requires a fast and 
ubiquitous wireless network, which makes it necessary to expand the original network 
capacity. However, increasing the number of macro base stations (BSs) in the cellular 
network significantly reduces the gain because of the elevated interference. Heteroge- 
neous networks (HetNets) provide services across both macro and small cells to optimize 
the mix of capabilities and have attracted much attention in the recent years [1, 2]. Relay 
networks, as one kind of HetNets, is a promising solution of performance enhancement 
and energy saving. Relay is one of the proposed technologies in LTE-Advanced 
networks for coverage expansion and capacity enhancement [3]. Relay Nodes (RNs) are 
connected to the core network with wireless backhaul through an evolved Node B (eNB) 
and they usually have much smaller energy consumption than BSs. 

Over the years, cellular networks have become an important source of energy 
consumption, reducing the network energy consumption has benefits in both economic 
and environmental aspects, which encourages people to look forward alterative power 
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supply. The usage of renewable energy sources such as wind power and photo-voltaic 
energy can significantly reduce traditional energy consumption. Many kinds of renew- 
able energy sources have been studied for green networks as introduced in [4]. 

The improvements of network performance have been widely studied. In the paper 
[5], micro base stations are deployed to increase the area spectral efficiency (ASE) value. 
The results show that the ASE can be improved significantly when the micro base 
stations located at the edges of macro cells where the received signal is worst. The energy 
efficiency problem of cellular networks is studied in [6], excessively increasing the 
number of micro base stations may reduce the energy efficiency, and there exists an 
optimum number of deployed BSs. The capital expenditure and the total operational 
expenditure are considered as the base station costs in [7]. A hierarchical power manage- 
ment architecture is introduced in [8], the authors focus on effectively exploit the green 
capabilities of networks. A green base station offloading model is proposed in [9], by 
predicting the value of green energy collected and updating the residual energy of each 
BS, the maximum number of users that each BS can offload can be calculated to achieve 
different network performance. The article [10] mainly focuses on the energy supply 
aspects in the network and investigates how to optimize the using of on-grid energy to 
reduce the carbon emission rate. A green base station optimization algorithm 1s proposed 
in [11] to reduce the traditional power consumption, the transmission power of BSs can 
be adjusted to improve network performance. The deployment of RNs has been used to 
improve network performances in the recent years such as connectivity by [12] and 
lifetime maximization by [13]. The article [14] considered finite-state Markov channels 
in the relay-selection problem to increase spectral efficiency, mitigate error propagation, 
and maximize the network lifetime. These studies usually focus on the user distribution 
in peak periods to ensure the network performance in the worst case while ignoring the 
differences of traffic demands in different positions and time periods, which may cause 
much resource waste in the low-load state of the network. 

In this paper, a more practical condition is considered. We deploy RNs to improve 
the network capacity where one BS has already been deployed and the changes of user 
distributions among different time is full considered. In addition, The RNs are equipped 
with energy harvesting devices so that they can harvest ambient energy. RNs still need 
to be connected to the on-grid power network in case that the renewable energy sources 
are inadequate. 

The remainder of the article is organized as follows. In Sect. 2, the system model is 
introduced. In Sect. 3, the problem, constraints and objective function are defined. 
Section 4 describes the proposed algorithms for these problems. Section 5 gives the 
simulation parameters and results. Section 6 concludes the paper. 


2 System Model 


Consider a wireless network in which all users are served by one BS b. The capacity of 
this network needs to be expanded to satisfy the growing traffic demand by deploying 
RNs. The deployed RN set is denoted by R. The users, denoted by U, always connect 
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to the BS b or a RN r which can provide the highest signal strength to them. The signal- 
to-interference-plus-noise ratio (SINR) of user u can be written as 


P (b)8, 
T (b) = — R 
>, FAY) 8 + oO 
rER 
P89 
= bet Y PGE, Fe 6) 
Dut È PA)» +6 
r'ER,r' #”r 


I, (b) and T(r) denote the SINR of users connected to the BS b and the RN r, respec- 
tively. P„(b) and P(r) denote the transmission power of BS b and RN r, respectively. 
g,,, 18 the channel gain from BS b to user u and g,, is the channel gain from RN r to user 
u. Thermal noise is denoted by o”. For simplicity, the shadow fading and small size 
decline are ignored, only the path loss is considered. Then the network capacity of user 
u can be calculated from SINR as 


C, = 6, \log,d +T,) (3) 


Bis the bandwidth assigned to user u in equal bandwidth scheduling. This sched- 
uling allows BSs to share the resources equally among users [15]. 

The power consumption of a BS consists of two parts. The first part is the static 
power consumed with no transmission, and the other part depends on the transmission 
power [16]. It can be given by 


P(b) = P.(b) + a, P.(b) (4) 


P (b) is the static power of each BS and a, scales the transmission power depending 
on the load. 

The power consumption model of each RN is similar, however, unlike previous 
studies, we consider their energy harvesting abilities. With the help of energy harvesting 
devices, the RNs can produce energy from renewable sources such as sunlight or wind. 
However, renewable energy sources may be limited and the generated energy may not 


——*” _ as the energy 


P(r)max 
harvesting rate, where P,,, and P(r)max denote the average harvesting power and the 


maximum consumption power of each RN, respectively. The RNs can consume less on- 
grid power as long as p > 0 and can work for a whole day without consuming any on- 
grid power when p > 1. Thus the consumption power of each RN can be written as 


fo p>1 
ca ee ee eye ©) 


be enough to totally power the RNs [17]. So we define p = 


P(r) is the static power of each RN and a, scales the transmission power depending 
on the load, the average harvesting power is denoted by P, 
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It’s assumed that users connected to the network always have data to transmit with 
full buffer, in addition, no power control algorithm is used. Therefore, BSs and RNs are 
fully utilized, a, and a, are constant for all BSs and RNs. 


3 Problem Definition 


Deploying RNs can improve the network capacity to a higher level, but it also brings extra 
energy consumption. By efficiently deploy RNs, we can minimize the extra energy 
consumption while meeting the requirement of capacity. The problem can be formulated as 


min )) P(r) 

rER (6) 
st. Y C(BUR)>A-C, 

uEU 


C,(B U R) denotes the capacity of user u when both BSs and RNs are deployed. C, 
is the original network capacity without deploying any RNs and the multiplier 2 > 1 is 
the desired increasing times over C, after deploying RNs. 


4 Solutions 


Affected by user living and working conditions, the user distribution may always change 
with time in a day [18]. The deployment of BSs generally aims at satisfying the capacity 
demand in peak hours, while ignoring the resource wastes in the other hours. It has been 
reasonable in the past few years for the expensive costs and deployment inconvenience 
of traditional BSs. With the deployment of RNs, these problems matter less for low costs 
and high flexibilities of RNs. A whole day can be divided into several parts, each of 
them corresponding to a unique user distribution. Then the RNs can be deployed in each 
time slice to meet the capacity requirements. 

There can be lots of candidate locations for RN deploying in this area and we need 
to select some of them. Furthermore, with the rising number of deployed RNs, the 
network performance may not always increase due to the interference effect, especially 
when the RNs are located close. The strong interference between each other even 
decreases the network capacity. On the other hand, influenced by the user distribution, 
capacity contributions from differently located RNs may vary dramatically. Therefore, 
the RN deployment problem is actually a combinatorial problem and it quickly becomes 
unsolvable when the number of candidate locations becomes large. 

Thus, a greedy algorithm is proposed to select the relatively better locations of RNs. 
In this algorithm, a RN will be deployed at the location which can provide the most 
increase of network capacity. The RNs will be added iteratively until the capacity 
requirement is satisfied. 
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Algorithm 1. Greedy RN Deployment Algorithm 


Initialize R = Ø 
while »C,B UR)<A-C, do 


ucU 


for all re Reinup do 


Fonosen = arg max 1 C,(BURUr)- )C,(BUR) 


re Reap ueU ueU 
R = R U I hosen 
Konp = Rono 7 I Hosen 
end for 
end while 


In Algorithm 1, the deployed RN set is denoted by R and the candidate RN set is 
denoted by Rcanp: The algorithm should be executed for every time slice. Then we can 
get a RN deployment schedule which consists of N sets of RNs, N is the number of time 
slices. One set of the RNs will work only during the corresponding time slice and will 
be turned off during the other time slices. All the RNs can continuously harvest energy, 
even in turn-off time. 

Then we focus on the energy consumption problem. Long turn-off time may cause 
energy waste due to the limited battery capacity. One set of RNs which supposed to 
work only in time slice ti actually can also work in tj as long as they have harvested 
enough energy. However, the user distribution of tj may be different and there is already 
one set of RNs deployed in tj. Two sets of RNs working together may decrease the 
network capacity due to the interference. So we propose an optimization algorithm to 
merge the two sets of RNs together while satisfying the capacity requirements of both 
time slices. In this algorithm, the RNs deployed in two time slices are simply merged 
together firstly, then the nearest two RNs are replaced by anew RS which located at the 
middle position of them, which can effectively avoid redundant deployment and reduce 
the interference. 
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Algorithm 2. Deployment Optimization Algorithm 
Initialize R; ; = R; UR, 


while 2C, PURS AC D> GOUR I A Cdo 


ucU 
minDist = ce 
minDistRNs = Ø 
for all”. € R; y do 
for all”, E€ Riy, #7, do 
if dist(r,,r,) < minDist then 
minDist = dist(r,,7, ) 
minDistRNs =r, Ur, 
Via = GetMidRN(r,,1, ) 
end if 
end for 
end for 
ig T Ra — minDistRNs 


Ry = Riy U Í mia 


R 


end while 


R; denotes the merged RN set and will work in both time slices ti and tj. dist(r,, r,) 


gives the distance between r, andr,, getMidRN(r,, r,) gives a new RN that located at the 
middle position of r, and r, 


5 Simulation Results 


The deployment area size is 3 km x 3 km and served by one BS. RNs need to be deployed 
in this area to improve the capacity. More candidate locations can get more accurate 
results but also brings extra calculation complexity. The number of candidate locations 
is set to 1600 after a set of experiments. The other parameters used in simulations are 
given in Table 1. 
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Table 1. Simulation parameters 


Parameters 


Target capacity multiple 
Number of time slices 


Bandwidth 
Static power of BS 


Transmission power of BS 


Static power of RN 


Transmission power of RN 


BS to user path loss 
RN to user path loss 
Thermal noise 


Values 

1.5 

12 

2 GHz 

168 W 

39.8 W 

20 W 

1 W 

128.1 + 37.6 * log10 (d(km)) 
37 + 30 * log10 (d(m)) 
—174 dBm/Hz 
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Firstly, we investigate the impact of RN deployment on network capacity. As is 
shown in Fig. 1, with the continuously deploying of RNs, the network capacity increases 
significantly at first, then it slows down and eventually decreases due to the gradually 
increasing interference. The network capacity can be influenced by interference among 
RNs significantly. The maximum capacity that can be reached by deploying RNs is 
limited. Expanding capacity to the maximum needs a large number of RNs, which may 


be inefficient. 


The deployed positions of RNs are as shown in Fig. 2. Most of them located far from 
the BS due to the lower SINR there. In addition, their locations are obviously affected 
by the user distribution. When user distribution changes in different time, the optimum 
deployed positions of RNs will change accordingly. 








Normalized Network Capacity 











Number of Deployed Relay Nodes 


Fig. 1. Increased capacity by relay nodes 
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Fig. 2. Positions of deployed relay nodes 


In the following experiments, we focus on the energy consumption problem. If a RN 
only works in non-adjacent time slices, it can make full use of its renewable energy. So 
we divide the time slices into many groups and each group consists of several non- 
adjacent time slices. The time slices in the same group has same RN deployment. If the 
number of time slice groups is m, then m sets of RNs will be deployed. When the energy 
harvesting rate is above 1/m, our split combination solution will reach zero on-grid 
energy. With a certain energy harvesting rate, more groups of time slices can reduce the 
on-grid energy significantly. Deployment in 4 groups of time slices can save about 74.5% 
on-grid energy over only one group of time slices (Fig. 3). 
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Fig. 3. On-grid energy consumption of deployed relay nodes 


Energy Harvesting Relay Node Deployment for Network Capacity Expansion 309 


6 Conclusions 


In this paper, we propose a greedy RN deployment algorithm and an optimization algo- 
rithm to improve the network capacity efficiently. The energy harvesting ability of RNs 
and user distribution changes are considered in the process of deployment. Imprudently 
increasing the number of RNs may decrease the network capacity due to the interference. 
The proposed deployment algorithm greedily selects relatively better positions from 
candidate locations to deploy RNs until the required capacity is satisfied. Due to the 
heuristic nature of the algorithm, the complexity of deployment problem is significantly 
reduced. Then the optimization algorithm merges different sets of RNs according to the 
energy harvesting ability to make full use of renewable energy. Zero on-grid energy 
consumption can be achieved by this algorithm. The simulation results show that the 
capacity demand can be satisfied efficiently and the on-grid energy consumption can be 
significantly reduced by the proposed algorithms. 
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Abstract. Kernel fuzzy K-means (KFKM) clustering algorithm is widely used 
to manage the fingerprint database for WiFi indoor positioning system to reduce 
the computational complexity of the position matching process. In this paper, we 
propose a novel WiFi positioning scheme based on KFKM algorithm, which can 
achieve a better precision by further optimizing the parameters employed in 
KFKM. Our proposed scheme consists of three steps. First, we choose an interval 
of reference points (RP) to build the fingerprint database. Then we decide an 
appropriate number of clusters based on the structure characteristics of fingerprint 
database using sample density method. During the process of clustering, we opti- 
mize the kernel parameter by approximating actual kernel matrix to a hypothetical 
ideal kernel matrix to improve the positioning precision. Through simulation 
results, we show that compared with the existing KFKM algorithm, our proposed 
scheme achieves 23.48% improvement in terms of positioning accuracy. 


Keywords: WIFI positioning - KFKM - Parameters optimization 
Fingerprint database - Clustering 


1 Introduction 


The basic idea of the location fingerprinting method for WiFi positioning consists of 
two steps. The first step is to build a database that stores pre-recorded received signal 
strength (RSS), or “fingerprint”, from different WiFi access points (AP) at different 
reference points (RP). Then, the second step is to estimate the target’s location by 
matching real-time RSS with the offline-constructed database (Fig. 1). 

The main limitation of the matching process 1s the high implementation cost when 
the database is too large to traverse. Hence, some improved clustering methods were 
proposed to manage the database such as K-means and fuzzy K-means (FKM) algo- 
rithms. However, non-linear data is beyond K-means method’s processing power since 
the objective function is non-differentiable. Motivated by this, in [1], the authors came 
up with fuzzy K-means (FKM) method, which aim to solve the non-differentiability of 
the objective function; recently in [2], Saadi et al. used an un-supervised machine 
learning K-means clustering based on two LEDs and achieved great accuracy. In this 
paper, we propose a new scheme to strengthen the performance of the WiFi positioning. 
We first amend the data acquisition process of the database; second, we use sample 
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a RP: 
Reference Point 


AP: 
Access Point 


| Nre target 








Fig. 1. The schematic diagram of database 


density method to modify cluster amount to match with the characteristic of database; 
last, we propose an ideal kernel matrix to determine the kernel parameter o. 


2 WiFi Indoor Positioning System Based on KFKM 


WQNN is an effective matching algorithm in WIFI positioning [3]. Suppose a WIFI 
wireless network with L access points, denoted x { AP, ,AP,,° -AP, }. There are n RP 
in total, denoted by T ala Sadh i = 1,2,---,L. At on-line stage, the RSSI of the 
target, denoted by s = (Sis Say S ay will maid with the fingerprint database by eval- 
uating the Euclidean distance between s and all fingerprints. Choose Q (Q > 2) smallest 
distance and average the Q corresponding positions, the coordinate (X, ĵ) is the estimated 
location of the target’s positions, 1.e. 








1 
Q 
MSs D; +€ 
(X,9) = » a X On Wi) (1) 
i=l y ] 
i=] D, +E 


Where (x;, y;) is the coordinate of the i RP, set € to 0.003 to prevent the “zero divisor” 
condition. In order to evaluate the precision of positioning test, the root-mean-square- 
error (RMSE) between the estimated position and the real position is proposed to 
measure errors. What’s more, a RMSE drawn from one positioning test doesn’t represent 
the performance of an algorithm. Hence, we collect RMSE from times of experiments 
to evaluate the algorithm comprehensively. In general, cumulative probability curve of 
RMSE is an intuitive representation for evaluation. 

The Chemistry building in BNU was set to verify the optimizations of database 
structure would improve localization accuracy. Considering storage capacity and calcu- 
lation complexity, set RP interval = 2 m and collecte RSS from 5 AP at all RP(black 
points) in Fig. 2, we have the database of the on-site test. 
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Fig. 2. Ichnography of on-site positioning test 


3 Parameters Optimizations for Positioning System 


In this section, we propose a series of parameter optimizations to the WiFi positioning 
system, including the modification of clusters amount k and kernel parameter o of kernel 
function, denoted by POKFKM algorithm. 

Function_1: set the cluster amount k according to the structure of RP. 

The amount of clusters k plays an essential role in the performance of KFKM clus- 
tering algorithm. The main idea of sample density method [4] is to analyze all RP in the 
database to find out the amount of density peaks that extrude among neighbors and far 
from other peaks. p; is the density of the i” RP, 6; is the smallest distance between the 
i” RP and another RP with higher density, indicated as follows: 


p= Dy xd- d) itj, 5, = min, (dy), iF) 2) 
J 


Figure 3 shows the 6-p situation of the on-site test database. There are 7 peaks in the 
upper right corner of the figure. Thus, k = 7 is the optimal cluster amount. 


Sample density 5-p situation of on-site localization test 
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Fig. 3. Finding of density peaks to determine k value 
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Function_2: set the optimal kernel parameter o to get better clustering result. 
Gaussian kernel function 1s: 


= 2 
k(x, y) = exp) (3) 
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Algorithm: WiFi positioning assisted by POKFKM clustering 
Function Wifi positioning(s) 
while off-line stage() do 
set AP amount and RP interval for fingerprints collection; 
while RP ++ 


collect fingerprints x, =[x,,,x,.-° 


"x; | into database F = Bere omer ie 
end while; end while 
function_1: get the cluster amount k according to the structure of RP 


function_2: set the optimal kernel parameter o to get better clustering result 


function 3: KFKM clustering 


d 1,0) => $ u," d(%,).9O,) 


== oo o a 
S 1/ [x(x %,) -2K(x,,0,) +K, 0 du; 


ij 
j=l 


,§=1,2,---,n; j =1,2,---,k 


until |J(U, 0)'- J(U, 0)°"\<e 
end function 3 
while new_arrival(s) 
WQNN(s); //traverse the clusters and match with RP 


end function. 


While Il-I| means the Euclidean distance of vectors. The experiential value of o is 
given by the reciprocal of sample hypersphere radius [5] or enumeration, none is optimal. 
Therefore, the optimal o 1s supposed to approximate the actual kernel matrix to the ideal 
matrix [6]. K-means algorithm firstly divides the database into k clusters and gets a row 
vector L with n dimensions. The element L; = k’,1 <k’ < k,i = 1,2, -n means the 
affiliation of corresponding sample. 
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With the purpose of making the samples within the same cluster as similar as possible 
and samples from different clusters as unlike as possible, the ideal kernel matrix should 
meet the following condition: 


k _ 1, L =L; 
i” LO LEL (4) 


The kernel parameter that approximate the kernel matrix to the ideal one is optimal. 


The similarity can be quantified by the difference between k;; and ki 


Ek.) = YY (kj - ky (5) 


i=1 j=l 


E / 
The solution to aaa = 0 (noted as o,) minimizes E(k, k’). 
o 


4 On-site Indoor Experiment 


In order to verify the performance of POKFKM algorithm, we conduct contrast tests 
between (1). KFKM clustering; (2). POKFKM clustering. The Clustering result is shown 
in Fig. 4. Cluster centers are denoted by crosses. 





1) KFKM Clustering circumstance 





2) POKFKM Clustering circumstance 


Fig. 4. Clustering result of original and optimized Clustering methods 


After clustering, instead of matching all reference points in the database, it is capable 
to complete matching process by comparing real-time RSS with 7 cluster centers and 9 
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RP (at most) in the nearest cluster for 16 times in total after POKFKM clustering. 
Therefore, the computational load reduces by 74.6%. 

After initialization, first position the test position RSS to the nearest cluster center, 
then traverse other RP within the cluster using WKNN, estimated position is the result 
of these two steps. We also use cumulative probability curves of RMSE of 70 times of 
on-site tests to illustrate the positioning accuracy of the three different clustering results 
above. The simulation result is illustrated in Fig. 5. 


RMSE Cumulative Probability Curve 
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Fig. 5. KFKM, POKFKM algorithms’ positioning errors 


It’s shown that POKFKM have better performance than KFKM clustering. The 
average RMSE of POKFKM is —23.48% narrowed than KFKM, which proves that 
POKFKM method has superior positioning precision than the KFKM algorithm. Mean- 
while, POKFKM has less standard deviation of RMSE, which stands for higher robust- 
ness. KFKM Clustering method has 56.9% probability to derive expected error, while 
POKFCN method is more reliable with the probability of 83.3%. Hence, POKFKM 
clustering method largely increased the precision and the reliability of WiFi positioning 
system. 


5 Conclusion 


In this paper, three optimizations for WiFi indoor positioning system were proposed to 
enhance WiFi positioning. Simulation result has shown that the optimization of kennel 
function and cluster amount performed better than original KFKM algorithm. During 
on-site positioning experiments in the chemistry building in BNU, reasonable deploy- 
ment of RP and AP made the database both compendious and integrate. POKFKM 
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algorithm was proved to be more reliable and robust and substantially increased the 
efficiency of WIFI indoor positioning. 
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Abstract. In this paper, the cognitive radio sensor node can harvest energy 
from the radio frequency signal which is transmitted by the primary user. A time 
switching protocol was used to divide cognitive users’ time into three phases: 
spectrum sensing mode, energy harvesting mode and data transmission mode. 
Therefore, the optimal energy harvesting mode time selection is a question to be 
solved. We consider a non-cooperative game model in which cognitive users are 
regarded as selfish players aiming to maximize their own energy efficiency. 
Then we prove the existence and uniqueness of Nash Equilibrium. A distributed 
best response algorithm is used to obtain the Nash Equilibrium. The simulation 
results prove that this algorithm can converge to the same equilibrium from 
different initial values. At last, we analysis the influence of various system 
parameters on the results of Nash Equilibrium and energy efficiency. 


Keywords: Energy harvesting - Energy efficiency 
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1 Introduction 


In recent years, with the increase of network energy consumption and the development 
of green communication, energy harvesting has drawn more and more attention. 
Energy harvesting is a technology to keep self-sustaining and prolong network lifetime 
by harvesting the ambient energy (such as wind energy, solar energy, heat, etc.). 
Moreover, the sensor network with the capacity of harvesting energy from the ambient 
radio-frequency (RF) signals has received extensive research. Energy harvesting 
wireless sensor network (EHWSN) can greatly prolong the lifetime of the sensor nodes, 
which lays foundation for the development of emerging technologies such as big data 
and Internet of Things (IoTs) [1]. But it is worth to note that the sensor node can only 
work on the unlicensed spectrum band. And the unlicensed spectrum band are being 
more and more crowded, which limit the development of the sensor network. 
Meanwhile, the survey shows the spectrum efficiency of licensed spectrum is rel- 
atively low. So energy harvesting cognitive radio sensor network (EHCRSN) has been 
put forward. In EHCRSN, the second user is the sensor node equipped with the ability 
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of spectrum sensing, which can also harvest the RF energy. But energy harvesting and 
data transmission of the sensor user can’t be implemented at the same time. Therefore, 
two practical receiver methods have been proposed in [2], namely the time switching 
protocol (TS) and power splitting protocol (PS). In the former, the receiver harvest 
energy in the first period and then transmit the information in the remainder time. In PS 
protocol, the receiver split the received signal power into two streams: energy har- 
vesting and information transmission. 

There have been several papers in the literature that focus on the TS and PS 
protocol in the past years. In [3], the author research a decode-and-forward energy 
harvesting relay cognitive network using TS protocol. However the scenario is too 
specialized. [4] research the relay selection question using TS and PS protocol in 
cooperative cognitive radio network. [5] investigates a distributed power splitting 
architecture with the purpose of achieving the optimal sum-rate through selecting the 
power splitting ratios. RF energy harvesting in DF relay network is studied in [6]. In 
this paper, TS protocol, PS protocol and a hybrid TS-PS protocol are applied to 
compare and analyze outage probabilities and throughput. 

In the aspect of cognitive radio sensor network, the author study the power allo- 
cation in cognitive sensor networks in [7]. However, no energy harvesting and TS 
protocol are considered in it. [8] develop a framework and propose a low-complexity 
algorithm in EHCRSN. [9] research the optimal mode selection question to maximize 
the throughput of the sensor node in EHCRSN. But to the best of our knowledge, there 
is seldom investigation of applying TS protocol in EHCRSN and using the best 
response algorithm to solve it. The main contributions of this paper are summarized as 
follows: 


1. We propose a new cognitive radio sensor network architecture which discuss the 
question of how to select the energy harvesting time to balance the energy con- 
sumption and energy replenishment in EHCRSN. 

2. We reply the time switching relaying (TSR) protocol into the cognitive users, which 
divides the time T into three parts: spectrum sensing phase, energy harvesting phase 
and data transmission phase. And we analysis the energy efficient maximization 
question through selecting the optimal energy harvesting coefficient. 

3. We formulate this problem as a non-cooperative game. Then we use the best 
response algorithm to reach Nash Equilibrium and prove the good performance of 
our algorithm. At last, we get the result that different parameters variations have the 
different influence on the energy harvesting time and system utility. 


The remainder of the paper is structured as follows. Section 2 describes the system 
model of primary user and cognitive user. In Sect. 3 we formulate the maximization 
question as a non-cooperative game and prove the existence and uniqueness of Nash 
Equilibrium. Then best response algorithm is introduced in Sect. 4. Finally Sect. 5 
presents the numerical results and Sect. 6 concludes this paper. 
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2 System Model 


We consider a cognitive radio sensor network with one primary user (PU) and N 
cognitive users (CUs) as illustrated in Fig. 1. The cognitive users, also called secondary 
users (SUs), are sensor node equipped with the ability of spectrum sensing. Assume 
that each cognitive radio sensor user can harvest energy and is equipped with the 
energy storage device. Denote N = {1,2,...,N} the set of CUs. 





CU: cognitive user 
PU: primary user 


Fig. 1. The system model of the primary user and cognitive users 


A. Primary user model 


The primary user is work in the licensed band and shared it with cognitive users. 
The PU has two modes: data transmission mode and idle mode, as shown in Fig. 2(a). 
Because the PU’s state will always switch from transmission to idle, we consider the 
sum of a transmission state duration and it’s adjacent idle state duration as an analytical 
interval T. Assume that PU won’t always keep one state for a long time, which is the 
same as actual situation. 


Tr Ti 
Ca) Two stages of the primary user 


Spectrum sensing Energy harvesting Data transmission 


(1-a)T Bi aT (1-Bi) aT 
Cb) Three stages of the ith cognitive user 


Fig. 2. The stages of primary user and cognitive users in time T 


PU is regarded as a RF energy provider when it is in the data transmission. And the 
CUs can harvest the RF energy from the PU in this phase. 
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B. Cognitive user model 


In Fig. 2(b), cognitive user has three states during the time T: spectrum sensing 
phase, energy harvesting phase, data transmission phase. Among them, « is the coeffi- 
cient of spectrum sensing time and divided T into sensing phase and non-sensing phase. p 
is the fraction of energy harvesting time coefficient. In the first duration (1 — «)T, CU 
senses the PU’s state and gather the sensing result together cooperatively. In this paper, 
we assume the coefficient « is fixed and power consumption P, is a constant value. 

In the first duration (1 — «)T, CU senses the PU spectrum usage information and 
gather the sensing result together cooperatively. In this paper, we assume the coefficient 
a is fixed and power consumption P, is a constant value. 

Then CU can harvest energy from the PU when the PU is in the transmission 
model. We consider that the CU doesn’t have its own power supply and totally depends 
on the RF energy harvesting. And all of the CUs work in half-duplex mode, thus they 
can’t transmit data and harvest energy at the same time. In this duration «fT, the signal 
CUi received can be expressed as 


You = VPsBix1 + ny (1) 


Where gi is the channel gain from PU to the CUi. Ps and x1 are the fixed transmit 
power and the transmission information of the PU. n* is the additive Gaussian noise 
with zero mean and OF y variance. For simplicity, we ignore the influence of the 
Gaussian noise. The energy CUi harvested during the duration «pT can be expressed as 


Qi=n- oB,T - Ps- [gil (2) 


where y is the energy harvesting efficiency. Assume the energy Q; will be used all to 
the next data transmission and spectrum sensing. The transmission power of CUi is 
assumed to be so small that the harvested energy is enough to use. 

In the rest time (1 — p;)aT, CUi will transmit data using the harvested energy to the 
other CUs or the PU. Since we know the energy consumption during the sensing period 
is P,(1 — «)T. We can calculate the transmission power of CUi as: 


P= Qi = Pell = a)T 
‘(1 — B)T 
1 Oo 


j= 
= oe ey teal =e P 
(Bi Pa lap- Pa) 





C. Maximization of energy efficiency 


We assumed that the cognitive users are deployed in the around of the PU. and the 
interference produced by other cognitive users is far greater than that from the PU. Thus 
the SINR of one particular cognitive user is determined by the influence of transmission 
power of other CUs [10]. We can write the signal-to-interference-plus-noise ratio 
(SINR) of the ith CU as 
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hie X Di 
SINR; =——“#*Pi__ (4) 


>» hij X Pj = ae 
j=l j#i 


Where hij denote the channel gain between the CUi and CUj, «2 is the Gaussian 
noise introduced by the environment. We konw the SINR; is in relation to the energy 
harvesting coefficient vector B; = [F1], P2, -- -Pn] from the Eqs. (3) and (4). 

To insure the data transmission of the PU would not be affected, the interference 
power from CUs to PU must be limited in a tolerable range. Assume that the largest 
transmission power of CUs is Pmax. Conversely, if the CUs wants to transmit infor- 
mation normally, its transmission power must be larger than a fixed value Pmin. 
Therefore, we get the range of CU1’s transmission power is 


P min < Pi < P max (5) 
Where Pi is defined in Eq. (3), and we can get the inverse solution of f.. 


OP min + Pres (1 — a) 
a(Pmin + Ps|gil”) 


aP max + Pses(1 — a) 
(Pmax + nP.|gil”) 


IA 


P 


IA 


(6) 


In order to simplify the description, we define the minimum and maximum value of 
p as Buin ANd Bmax respectively. 

On the basis of Shannon formula, we know the achievable transmission rate of CUi is 
determined by the SINRi given by Eq. (4). Therefore the rate of CUi can be obtained by 


r; = Blog(1 + SINR;) (7) 


Where B is the transmission spectrum bandwidth shared by PU and CUi. We define 
energy efficiency as the achievable data rate per unit power consumption [11], which 
can be formulated as 


i; Blog(1 + SINR; 
total ses + P; 


(8) 


In this paper, we express the time selection problem with the goal to maximize the 
energy efficiency, which is a globally optimal network-wide performance problem. In 
actuality, this is a tradeoff of time selection between energy harvesting mode and data 
transmission mode. We need to solve the following network utility maximization problem 


N 
max Ui 
Da 
OP min + Pres(1 — a) 
OCP ite T nP.\gi|") 
0<f;<1 


AP max + Pres (1 ~~ a) (9) 


<P < 
OP ias e nP.\gil") 
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3 Non Cooperative Game and Nash Equilibrium 


D. Non Cooperative Game 


Game theory is a complex activity where players are contend with each other 
observing a series of rules [12]. In this paper, due to the interference is related to all the 
CUs’ power, the maximization problem of each cognitive user is linking together. 
Make a assumption that all the CUs are selfish and rational, which means they are only 
interested in maximizing their own energy efficiency and have the common knowledge. 
Thus, question (9) can be converted as follows: 


max uil Pi, B_;) 
B; (10) 
S.t. b: E S; 


Where B_,; = Br, ssa Pinis Pigi Bor stands for the energy harvesting coeffi- 
cients of all CUs, except the ith one. What’s more, S; is the feasible set of the ith link’s 
power splitting ratio. Next, we can model it as a non-cooperative game with N players, 
which can be expressed as follows: 


= (N, {Si}iens {lPi Pa) bien) (11) 


Player set N= {1,2...,i,...N}: the set of N cognitive radio sensor users. 

e actions {S;}: Each player i selects its energy harvesting coefficient fp; € S; to 
maximize its energy efficient. S$ = cm S; refer to the strategy space of N players 
and strategy vector s = {51,52,...,Sy} is a subset of S. 

e Utility function {u;(p;,p_;)}: the utility of player i. 


E. Existence of the Nash Equilibrium 


Definition 1. A Nash Equilibrium (NE) exists in a non-cooperative game if and only if 
an strategy vectors* satisfy 


uj(s;,S_;) >uj(s;,S_;) Vi € N, Vs; E€ S; 


That is to say, a NE is a strategy vector with the property that any player won’t have 
better utility by chancing its strategy. The following theorem tells us how to distinguish 
whether a NE exists or not for a particular game. 


Theorem 1. In the game g = (N, {S;}, {u;(p;, Pi) }), a NE exists if it have the limited 
players and the strategy sets S; are convex set, closed and bounded. Last the utility 
functions u;(p;,p_;) is continuous and quasi-concave for Vi € N. 


Proposition 1. The formulated energy harvesting time selection game g exists at least 
one NE. 
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Proof. Firstly, we have mentioned that there are N players in the game totally, which 
meets the first condition in Theorem |. Secondly, because the strategy sets of CUi are 
segments in two dimensions. It is straightforward to know that S; is convex, closed and 
bounded. Besides, the utility function u;(f) is continuous in p. Thus, we only need to 
focus on the quasi concavity of u;(f). 

If f(x) is quasi-convex, —f(x) is quasi-concave. So as long as we proved —u;( f) is 
quasi-convex, we can get the result that this game possesses at least one NE. 


Theorem 2. A continuous function f: R — R is quasiconvex if and only if at least one 
of the following conditions is satisfied. 


e f is nondecreasing. 

e f is nonincreasing. 

e there exists a point c € dom f which satisfy for t< c (and t € dom f), f is nonin- 
creasing, and for t>c (and t € dom f), f is nondecreasing. 


Now we perform the first-order derivative of the negative utility function, 














Blog(1 + SINR; ) 
EEEE Se ee 12 
AB) = ulh) =- a (12) 
Through algebraic simplification, the derivative function can be expressed as 
Ofi(B)  _ Oui(B) _ Opi Oui _ 
Op; Op; OP; OD; 
1 2 o4 
= Plg = P) - B - log(e)- 
(1 — $;) Lm (13) 
hix (pi + Pres) = In(1 + hixpi 
j=l j=ljži 
(Di + Poes) 


In reality, we know the primary user’s transmission power is far bigger than the 
sensing power of CUi. So the first half of Eq. (13) Ps| gil” 
established and the derivative oe > Q0 is true forever. The transmission power is 
increasing with the increase of f;. When the value of fp; is p,=** a in 
pi( ba) = 9 and rl, ; < 0 After analysis, we know there is only one constant fp, 


a 


— 7%; Pses > Q is always 





z we have 


which satisfy rl, i = 0. And f;(f) is firstly decreasing and then increasing with the 


increasing of f,. Therefore, we can conclude that u;(f) is quasi-concave and there exist 
at least one NE in the game. 


F. Uniqueness for the NE 


Definition 2. A game g is said to be supermodular if the following condition is 
satisfied 
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Ou; 


Os;05; 





>0,ViFAjJEN (14) 


We have the conclusion that a NE is unique if the game is a super-modular game. 
Therefore, if we only need to analyze the non-negativity of differential function of the 
utility function. The twice differential function can be expressed as 





0 u;(B) l 2 vs 2 
= Blog(e) x ———-.——_.,, (nP,, ||" — ——— Pes 
OP OP; i. -pya Tay T ia 
hihi |p? X hi — pss + X hi X pj) (15) 
j=l j#i 
x n n 
(Pi + Pses) (02 + a h; x pj) (ae + > hy x Pi) 
j=ljfl j=l 


According to the analysis, it is concluded that whether the game is supermodular is 


n 
depended on the part of p? x hi — Pses(&2 + n, hi; x p;). After simplification, we 
j=l j#i 
get the necessary and sufficient condition when the above part is greater than 0. 


SINR; > Č (16) 


Pi 


Because the sensing power is far smaller than the transmission power of cognitive 
users. As a result, the ratio of sensing power to the transmission power is a pretty small 
value and the above inequality is reasonable naturally. Finally, we can conclude that 
this game is a super-modular game and only exist one NE. 


4 Best Response Algorithm 


In this section, we will introduce a distributed best response algorithm which can 
achieve the unique NE from any initial state. The aim of this algorithm is to maximize 
the energy utility through all the users changing their energy harvesting coefficient f; 
one by one until a suitable termination condition is satisfied. 

First, when t = 0, set each player an initial value f/;(0) from its domain and 
compute their utility value u;(0). Then, each player will update its energy harvesting 
time coefficient f,(t) according to Eq. (17) in a fixed sequence while other players 
f_,(t) remain unchanged. 


B; = arg max u(B;, B) (17) 


Finally, the iterative process will stop until the following termination condition is 
satisfied. 
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> B+) -AO <n (18) 


i=l 


Where y is a extremely small constant. The algorithm can be programmed as 
follows. 


Algorithm1. Best Response Algorithm 


Initialization: t=O, total-error=100000, 2(0) eS, ,i=1,2,...N 


while total-error > n do 


for 1=1:N 
find the optimal B'O > Ba0+D=F OWĄ 
end for 


total-error = S IB @+)-BMY/ 


t<t+l 


end while 


5 Numerical Results 


In this section, we will present the numerical results to illustrate and demonstrate the 
superiority of the algorithm. We select eight cognitive users to analyze and the channel 
gain from PU to the CUs are from interval [0.3,0.6]. The channel gain hii and hij 
between CUs vary from interval [0.8,1.2] and [0.05,1.15] independently. The recent 
research shows that the achieved power of RF harvester can achieve 5 mW under 
1000 MHz. And we set the shared spectrum is 1000 MHz. The minimum and maxi- 
mum transmission power of CUi are Pmin =O.1 mW and Pmax =5 mW. For the fol- 
lowing examples if not mentioned otherwise, we set P, = 20 mW, P,., = 0.2 uW, 
a = 0.2, y = 0.5 and «2 = 1 mW. 

Figures 3 and 4 show that the best response algorithm can converge to a unique NE 
since the initial value of CU1 choose different values. It is observed that the conver- 
gence process of CUl’ energy harvesting coefficient fp, and the utility when the initial 
condition is set in different values. It can be seen from Fig. 4 that the utility of CU1 is 
rising constantly until reaching the NE. 

Figure 5 depicts the utility comparison between the initial condition and the finial 
condition achieving the NE. It shows the result that each player’s final utility is much 
larger than its corresponding utility in initial condition, which illustrates the superiority 
of our algorithm. Next, we will analysis the impacts of one parameter (e.g.: 7, Pses and 
N) change on the solution of NE and the utility value. 
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Fig. 5. Utility comparison between the initial condition and the finial condition 


Figure 6 show the finial energy harvesting time coefficient p; and energy efficiency 
ui versus the energy harvesting efficiency 7. It can be observed from Fig. 6(a) that p; 
decreases as 7 increases, because a larger n means less harvesting time to achieve the 








&— A-—A—A— A A A A A A S, E o, E, S, S, E, E S, 


J | eee 6 6 6 6 6 6 6 6 669 





I 
2 
N 

T 








4.57 


energy efficiency u; 
A 

















Energy harvesting time coefficient. 
© 
EN 









































1 L L L 2.5 1 1 i L 1 L 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 


Energy harvesting efficiency 7 Energy harvesting efficiency 7 








Fig. 6. The impact of energy harvesting efficiency on (a) energy harvesting time coefficient and 
(b) energy efficiency with N = 8, Pye, = 0.2 uW and « = 0.2 
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needed transmission power. And we can see from Fig. 6(b) that 7 has no effect on the 
finial energy efficiency. 

Figure 7 show the influence of CUi’s sensing power on energy harvesting time 
coefficient and energy efficiency. We can see that the energy harvesting time coefficient 
P; keep unchanged first and then increasing with the growth of CUi’s sensing power. 
Meanwhile, the energy efficiency keep unchanged first and then decreasing. This is 
because when the sensing power is less than a certain threshold, the change of sensing 
power is so small in relation to the transmission power that has no effect on the NE. But 
with the growing of sensing power and when it is bigger enough, the CUi need more 
harvesting energy to support it and the energy efficiency decrease based on the Eq. (8). 
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Fig. 7. The impact of CUi’s sensing power on (a) energy harvesting time coefficient and 
(b) energy efficiency with N = 8, 7 = 0.5 and a = 0.2 
N 
At last, we will analysis to the finial total energy efficiency uan= ` u; versus the 
i=1 
number of cognitive users N. Figure 8 replies that the total energy efficiency of the two 
carve is almost the same when there are two players and then have opposite tendency. 
And we can see from the blue curve that with the increasing of N, the total utility after 
game increases first and then decreases. Because when the user number is lower than a 
certain value, the interference from other users is relatively small and the total utility 
increase with increasing the number. However, with a further increase of N, the 
interference become stronger, which causes the direct decreasing of the total utility. 
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Fig. 8. The total energy efficiency versus different number of players with Pses = 0.2 uW and 
n = 0.5 (Color figure online) 
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6 Conclusion 


In this paper, we have researched how to select the optimal energy harvesting time to 
achieve the maximize energy efficiency using non-cooperative game. First of all we 
proved the existence and uniqueness of Nash equilibrium, and then we got the con- 
verging process to achieve the NE with the best response algorithm. The results 
demonstrate that the best response algorithm can converge to the same equilibrium 
from different initial values. Besides, non-cooperative game yields a good utility per- 
formance to address this energy harvesting time coefficient selection question. At last 
we Studied the effect of various system parameters on the results of Nash equilibrium. 
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Abstract. Urban traffic congestion is a major problem for urban transportation 
management all over the world. However, traditional research focuses only on 
detection and description of urban traffic situations, which are not enough for 
improving urban traffic conditions. In this paper, we distinguish two types of 
traffic congestion: traffic paralysis and traffic jams. The former is the state that 
traffic is almost stagnant in a large area and on many roads, and it will take a long 
time before recovering the normal traffic flow. In comparison, a traffic jam has 
less negative effect on traffic flow and recovers easily. According to this, we 
propose a traffic paralysis alarm system based on strong associated subnet to alert 
traffic paralysis incidents. The system orients to city road network, mines asso- 
ciation rules between road segments, constructs the strong associated subnets and 
detects traffic anomalies with floating car GPS data. We analyze two parameters 
of our proposed system with a true dataset generated by over 2000 taxicabs in 
Zhuhai and explain our system with a simulation experiment on VISSIM. 


Keywords: ‘Traffic paralysis - Association rule - Strong associated subnet 
Alarm 


1 Introduction 


With the rapid development of urbanization and the remarkable improvement of living 
quality in China, both the population and the number of private cars has increased 
significantly in cities. It is reported that by the end of 2013 the urban population of China 
accounted for 53.73% of the total population, and urbanization is expected to reach more 
than 60% by 2020. In addition, the number of private cars rose from 18.48 million in 
2005 to 123.39 million in 2014, according to the data from the National Bureau of 
Statistics of China. 

Urbanization means the movement of a large part of the rural population to cities. 
Urbanization improves the living standards of many people, promotes economic 
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development, and is more conducive to improved development of industry, education, 
science and technology, culture, etc. But urbanization also brings with it huge chal- 
lenges, such as air pollution, public health, and traffic congestion. Because of its impact 
on pollution and public health, traffic congestion is a key point in successful urbaniza- 
tion. 

With the development of the Intelligent Transportation System (ITS), there has been 
a lot of research on traffic congestion [1—3], but most of these studies are just to discrim- 
inate or describe the traffic situations. These studies are based on a two value state: traffic 
is either free or congested. But there is a big difference between a traffic jam and traffic 
paralysis. We should consider them as two different states when we analyze traffic 
situations in the real world. 

Figure 1 is our developed traffic state transition diagram. Area A shows the traditional 
traffic state transition when a traffic anomaly occurs, such as a car rear-end collision, it 
will cause a traffic jam. Each traffic anomaly itself is random, which is why the current 
research on urban traffic congestion issues remains at the level of identification and 
description of traffic situations, and makes it difficult to carry out further research. 
Therefore, we re-analyze the traffic state, add a paralysis state, and then get the new 
traffic state transition diagram shown in Fig. 1. In the new traffic state transition diagram, 
a transformation from a traffic jam to a traffic paralysis does not always occur, it depends 
on the following three factors: 


Spatial-temporal Location: the time point and the location of the traffic jam occurred. 
Road Network Topology: the road network topology of the jammed road segment 
and its surrounding road segments. 

e Traffic Flow: the size of traffic flow on the jammed road segment. 


' 
i 
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I 
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Fig. 1. Traffic state transition diagram 


These three factors determine whether a traffic jam can escalate to traffic paralysis, 
and they provide the possibility for us to predict traffic paralysis. 

We propose a traffic paralysis alarm system based on a strong associated subnet 
which consists of three modules, namely, data preprocessing, strong associated subnet 
mining, and traffic anomaly detection. The proposed system mainly depends on the 
strong association between the road segments in a strongly associated subnet. 
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2 Related Work 


The proposed system is mainly related to three issues: map-matching, association rule 
mining, and traffic anomaly detection. 


2.1 Map-Matching Problem 


Irregularities in GPS positioning systems prevent the direct use of GPS data. Map- 
matching algorithms correct the deviation of GPS data and matches GPS data to road 
segments by which to render the actual trajectories of vehicles. 

Typical geometric algorithms include point-point, point-line, and line-line algo- 
rithms. A point-point algorithm matches GPS data to the nearest point on surrounding 
road segments. The accuracy of such algorithms is unpredictable. 

In [4, 5], topological algorithms consider the connectivity of the road, adjacency 
relationships, and other attributes (such as one-way roads) to match. Paper [5] gives a 
C-measure based algorithm, which uses distance, angle, and connectivity to calculate 
match reliability. 

Advanced algorithms [6—9] generally use more complex computational models, have 
greater accuracy, and can meet some special needs. The ST-Matching algorithm 
proposed for GPS trajectories with a low-sampling-rate (e.g., one point every 2—5 min) 
[6], takes into account the geometric and topological structures of road networks and 
the temporal/speed constraints of the trajectories. Paper [7] proposes a Hidden Markov 
Model-based algorithm which accounts for measurement noise and the structure of the 
road network. In our paper, we use the algorithm proposed in [7] to correct GPS data. 


2.2 Association Rule Mining 


Association rule is one of the most active and widely used knowledge types in the field 
of data mining. Now there are many association rule mining algorithms. 

Apriority is the first typical rule mining algorithm, and many other algorithms 
based on the idea of apriority after that. Paper [10] points out that the early itera- 
tions of Apriority-based algorithms is the dominating factor for the overall mining 
performance, and proposes an effective hash-based algorithm that can generate 
smaller candidates. FP-Growth in [11] does not generate candidates and only 
requires two scans of transaction database. It compresses the transaction informa- 
tion to the FP-tree in which the support degree of item-sets is in descending order. 

There are many algorithms for different scenarios on association rule mining in the 
years of research and applications. For example, paper [12] presents an algorithm to 
mine rules in databases which are partitioned among many computers that are distributed 
over a wide area. Paper [13] summarizes the existing algorithms that address the issue 
of mining association rules from data streams. 
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2.3 Traffic Anomaly Detection 


A traffic anomaly is anything blocking the normal traffic flow, such as broke-down 
vehicle, rear end collision, or other accident. There is research on detecting traffic 
anomalies [1, 3]. A traffic-flow-pattern based algorithm which focuses on traffic volume 
and velocity on roads is proposed [1]. An anomaly detected by the algorithm is repre- 
sented by a subgraph of a road network where drivers’ routing behaviors significantly 
differ from their typical patterns. An improved nonparametric regression algorithm 
(INPRA) is presented [3]. Standard Deviation Algorithm is used to calculate and check 
the standard deviation between prediction traffic data generated by INPRA and current 
traffic data: if the standard deviation is larger than a predefined threshold, the algorithm 
will send out a signal for a possible incident. 


3 Architecture and Design 


3.1 System Overview 


In this section, we will present the architecture and design of the proposed system. The 
proposed system consists of three modules shown in Fig. 2: A. pre-processing module; 
B. strong associated subnet mining module; C. traffic anomaly detection module. 


GPS Point Trajectory -> Strong 


Road Segments Trajectory Associated 
Transformation Subnet 
Mining 


Traffic Anomaly 
Road Segments Trajectory Database detection 
Association Rule 
Spatial-temporal 
Position 
Generation 





Fig. 2. Architecture of traffic paralysis alarm system 


The module A mainly pre-processes digital map data, large text, and GPS data. First, 
a digital map database is constructed by importing digital map data into a spatial database 
and splitting each road into road segments. Second, the collected data must be 
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preprocessed because every text file of stored one-day taxi GPS data of Zhuhai city is 
bigger than 11 GB, and they are unordered and redundant. Therefore, we need to remove 
redundancy and sort by taxi ID and timestamp. Finally, deploying the map-matching 
server and importing GPS point data and corrected GPS point data into spatial database 
as GPS point trajectory database. 

The main purpose of module B is to mine strong associated subnet. First, road 
segment trajectory database is constructed by the ways of transferring GPS point trajec- 
tories as {Py , Po, ..., Pn} to road segments trajectories as {71, r2, ..., Fm}, pi represents a 
GPS point and r; represents a road segments. Then, the most frequent passed is selected 
as seeds through top-k road segments and these seeds are expanded into strong associated 
subnets. 

Module C is responsible for detecting traffic anomalies occurred at strong associated 
subnets. The travel speed of road segments in strong associated subnets is monitored 
and checked: if there occurs an extended low-speed situation, the proposed system will 
regard this phenomenon as a traffic anomaly and send out a signal. 


3.2 Pre-processing Module 


This module includes two parts: constructing a digital map database and map-matching 
for setting up a GPS point trajectory database. 

The process of constructing a digital map database can be summarized as the 
following steps: 


Step 1: Install PostGIS plug-in for the PostgreSQL object-relational database. 

Step 2: Download digital map data from OpenSteetMap using QGIS tools. 

Step 3: Import digital map data into spatial database using ogr2gor tools. 

Step 4: Install PgRouting plug-in for the PostgreSQL and split road into road segments 
with it. 


There are three steps in setting up GPS point trajectory database module: 


Step 1: Remove redundancy and sort by taxi ID and timestamp for these large text files. 

Step 2: Modify source code of Open Source Routing Machine (OSRM) map-matching 
server, then compile and deploy it. 

Step 3: Map-matching for every taxi GPS point, then import observed and corrected 
GPS data together into spatial database. 


3.3 Strong Associated Subnet Mining Module 


This module, combined with digital map database, expands the most frequent top-k road 
segments into strong associated subnets. As summarized by the following steps. 


Step 1: Load a corrected GPS point trajectory as pointy, = {P, P2, ---» Pn}, for every 
point p; of this trajectory, determine the road segment r; that point p; located at, 
then we can gain the corresponding road segments trajectory of pointy, as 
roadse gr, = {1\, 12, .--, fm}. At the same time, update the number of trajectory 
of r. 
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Step 2: Save road segment trajectories to road segments trajectory database. 

Step 3: According to the number of trajectories of every road segment, select the most 
frequent top-k road segments as seeds. 

Step 4: Expand seeds into strong associated subnets. Finally, these strong associated 
subnets are used as association rules to guide the alarm system. 


3.4 Traffic Anomaly Detection Module 


We monitor only these road segments that strong associated subnets contain, and regard 
these road segments as interesting road segments. It can be summarized as three steps. 


Step 1: Select these corrected GPS point trajectories which pass through interesting 
road segments from GPS point trajectory database. 

Step 2: Fill timestamp of every point of the selected trajectory according to the corre- 
sponding observed GPS point trajectory. 

Step 3: For every interesting road segment, we have the trajectories passed through it. 
This tells us how long every taxicab passes through it and the travel distance 
of every taxicab. With this knowledge, we can calculate and monitor travel 
speed of every interesting road segment, and check the successive low speed 
situation. 


4 Strong Associated Subnet Mining and Traffic Anomaly Detection 


In this section, we will introduce two key algorithms of traffic paralysis alarm system 
based on strong associated subnet: strong associated subnet mining algorithm and traffic 
anomaly detection algorithm. 


4.1 Strong Associated Subnet Mining Algorithm 


We firstly define six operators used by strong associated subnet mining algorithm. 


Preliminaries 
Definition 1 (cntRoadseg): cntRoadseg(r,) is the number of the road segment trajectories 
that contain road segment r,. 

Definition 2 (cntShared): cntShared(r,, ry) represents the number of road segment 
trajectories that contain both road segment r, and ry. It is the quantized form of the 
association between road segment r, and r,. Under the same conditions, the bigger the 
score of this operator, the stronger the association between road segment r, and ry. 

Definition 3 (cntU): cntU(r,, ry) represents the sum of these road segment trajectories 
that contain any one of road segment r, and r. 


y 
Definition 4 (support): this operator is defined as the following: 


support (r,, ry) = cntShared (r, ry) /cntU (r, ry) (1) 
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In the process of mining strong associated subnet, we check the possibility of road 
segment r, which is not in a strong associated subnet becoming a part of strong associated 
subnet. If road segment r, is adjacent to r, and is a part of strong associated subnet, 
support(r;, ry) is the first we check, it must satisfy the following: 


support (r,, ry) >A (2) 


where / is a predefined threshold and it will be analyzed in Sect. 5.1. 
Definition 5 (correlation): correlation(r,, ry) represents the correlation between road 
segment r, and r,. This operator is defined as the following: 


correlation (r ry) = cntShared (r ry) /2 


x (1/cntRoadseg(r,) + 1 /cntRoadseg(r,) ) (3) 


Definition 6 (cluster): The cluster degree of subnet V is the average of the correlation 
among these adjacent road segments in subnet V. This operator is defined as the 
following: 


Algorithm 1: cluster(r, V) 
Output: the cluster deg ree of subnet V after r added 


into 
1: clus= 0.0; 
2: for r,in V loop 
3 listRx=loadAdjacent(r,, V); 
4 for r, in listRx loop 
> clust+=correlation(r,,ry); 
6 end loop; 
7: end loop; 
$: clus /=2; 


9: listR=loadAdjacent(r, V); 
10: forr, in listR loop 
11: clus +=correlation(r, re); 
12: end loop; 
13: return clus /(V.size + 1); 


In Algorithm 1, loadAdjacent(r, V) is used to load the adjacent road segments of r 
in subnet V. cluster(r, V) is cluster degree of the assumed subnet which is made up of 
subnet V and road segment r. The second thing we check in the process of mining strong 
associated subnet is ensuring that cluster degree of strong associated subnet satisfies the 
following: 


cluster(r, V) > n (4) 


where u is the predefined threshold and it will be analyzed in Sect. 5.1. 
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Details of Algorithm 

Strong associated subnet mining algorithm mainly consists of two steps. First, selecting 
the most frequent top-k road segments as strong associated subnet seeds. Second, 
checking 1f there exists a road segment in strong associated subnets that adjacent to one 
or more road segments which are not in strong associated subnets and meet the thresholds 
for expansion: if there exists one, we call this one as scalable road segment, and add 
those meeting thresholds into strong associated subnets. This process is iterative until 
there does not exist one meeting the requirements. More precisely, the algorithm process 
is the following: 


Algorithm 2: SASM(minSup, minCls, maxNetsize) 


1: generate seeds and insert them into table Subnets 
2: do 
3 r, netid=loadScalable(); 
4: do 
5: r; =loadAdjacent(r, netid); 
6: if the size of netid >= maxNetsize then 
7 break; 
8 end if; 
Sh if r; is already in netid then 
10: continue; 
11: end then; 
12: if support(r, ri) < minSup then 
13: continue; 
14: end then; 
15: if cluster(r; , netid)< minCls then 
16: continue; 
17: end then; 
18: insert 7; into Subnets; 
19: i++; 
20: while(i <number of adjacent road segments of r) 


21: while(there is still a scalable road segment ) 


4.2 Traffic Anomaly Detection Algorithm 


In this section, we use the flow chart to present our successive low-speed based traffic 
anomaly detection algorithm. The proposed algorithm takes 30 s as a calculation cycle 
and computes average travel speed for every road segment in strong associated subnets 
in every cycle. This algorithm runs forever once started. It iteratively computes travel 
time and checks if there occurs a traffic anomaly. 
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5 Experiment and Analysis 


In this section, we first analyze two parameters, A and u, of strong associated subnet 
mining algorithm. Then, a simulation experiment based on VISSIM is carried out for 
explaining the theory of traffic paralysis alarm system based on strong associated subnet 
(Fig. 3). 
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associated subnet, recorded as roadseg 


Record current timestamp as ts in second 


form, initial variables: idx=0, start=1, 
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start*30, ts-end*30], recorded as tr 
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trajectory that pass 
through roadseg in tseg 


N 
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aveV meet threshold 
N 
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Fig. 3. A successive low-speed based traffic anomaly detection algorithm 


5.1 Strong Associated Subnet Mining Algorithm Parameter Analysis 


In the system testing process, we find that the scores of these operators, support and 
correlation, are low in most of the time, and the low score of operator correlation directly 
causes the score of operator cluster to be also low. Therefore, we perform an analysis 
based on the GPS data of Zhuhai, and get the conclusion that the smaller the sampling 
rate is, the smaller these operators’ scores are. 

The original GPS data of Zhuhai is sampled every 10 s. We manually dilute the 
original data and get another three datasets with sampling intervals of 20 s, 40 s, and 
80 s. For each of the most frequent top-200 road segments, calculating separately average 
score of the operator support based on four datasets. The statistical distribution of the 
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average score of operator support is shown in Fig. 4. Similarly, the statistical distribution 
of the average score of operator correlation is shown in Fig. 5. 
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Fig. 4. Distribution of score of operator Fig. 5. Distribution of score of operator 
support correlation 


It is easy to observe from the above two figures that the sampling rate is higher, the 
distribution peaks are at further to the right positions. 


5.2 Simulation Experiment Analysis 


The idea of our proposed system is that when a traffic anomaly is detected on a strong 
associated subnet, alarming immediately for this subnet to reduce the traffic entering 
this subnet, to avoid the transformation from traffic jam to traffic paralysis. We want to 
discover whether reducing the traffic entering this subnet helps avoid the transformation. 
We perform an experiment with VISSIM and the result can be concluded as: when a 
traffic anomaly occurs on a strong associated subnet, reducing the traffic entering this 
subnet can decrease the influence of the traffic anomaly. 

We simulate a traffic incident occurring at the 300" s and resolved at the 1500" s 
and collected a delay time dataset and a queue length dataset under different traffic 


volumes shown in Figs. 6 and 7. 
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Fig. 6. Delay time under different traffic Fig. 7. Queue length under different traffic 
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It is obvious that the delay time and queue length increase remarkably as the traffic 
volume rises. The delay time and queue length clearly show the traffic anomaly’ s influ- 
ence. Therefore, we can get the conclusion declared above: reducing the traffic entering 
this subnet can decrease the influence of the traffic anomaly. 


6 Conclusion and Future Work 


In this paper, we propose a traffic paralysis alarm system based on strong associated 
subnet. The system mainly designs and implements strong associated subnet mining 
algorithm and traffic anomaly detection algorithm based on the taxicabs GPS data of 
Zhuhai. The former is used to mining strong associated subnets with trajectories on road 
network, and the latter calculates travel speed of interesting road segments periodically 
and checks if there exists a successive low speed situation on interesting road segments: 
if there exists one, the algorithm sends out a signal. 

Two experiments are carried out in this paper. One examines the case of low score 
of operators and concludes that the smaller the sampling rate is, the smaller these oper- 
ators’ scores are. The second experiment uses VISSIM to simulate traffic situations when 
a traffic incident occurred. Under different traffic volume, we analyze delay time and 
queue length, concluding that reducing the traffic entering this subnet can decrease the 
influence of the traffic anomaly. 

In future work, the proposed system should be improved to solve real-time GPS data 
stream, how we store, pre-process, and analyze the data stream may need further 
research. 
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Abstract. Nowadays, diversified demand of different users is becoming a focus 
for improving network performance. Traditional network can’t meet the demand, 
so we turn to Software-Defined Network architecture, which realizes virtualiza- 
tion and abstraction of underlying hardware resource via separation of control 
and data planes. First, we propose a virtual network mapping algorithm to allocate 
link resources, using Ant Colony algorithm to find the optimal solution. Then we 
develop a virtual network fault recovery mechanism to satisfy the need of end 
users with different fault tolerance. The mechanism is achieved by the failure 
recovery algorithm named NumMap Algorithm, which provide varied network 
reliability levels for users of varied priority. By the end of paper, we conduct 
simulation experiments to evaluate the algorithms with performance metrics such 
as failure repairing ratio, success running ratio, and working link resource utiliza- 
tion. The results demonstrate the superiority of the proposed algorithm compared 
with ResRemap and ResBackup algorithms. 


Keywords: Virtual network - Software-Defined Network - Mapping algorithm 


1 Introduction 


Network virtualization is a new technology which can solve the problem of the internet 
and support the development of new technology. Important advantages of this technique 
is different network can share the underlying physical infrastructure. The virtual network 
is above the physical layer. Virtual nodes and virtual links constitute a virtual network. 
Network topology, the use of technology, the provision of services for different virtual 
networks are different. 

The purpose of network virtualization is to facilitate the configuration and manage- 
ment and to facilitate the construction of new network technology. Resource manage- 
ment has become the key to achieve the advantages of the network virtualization tech- 
nology. The resource management in the network virtualized environment should 
rationally design the management structure and the resource scheduling algorithm to 
realize the efficient sharing of the underlying physical network resources. In ensuring 
the virtual network user resource request conditions, greatly improve the network 
resource utilization. 
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The traditional network of network construction cannot meet the diverse needs of 
different business. Then SDN network is proposed, control and forwarding separation, 
to achieve the underlying hardware abstraction. SDN is a new network architecture 
proposed by Stanford University and is the main application architecture of this paper. 
With network virtualization technology, each experimental user can customize a virtual 
network and construct a different topology for each virtual network. The underlying 
physical network forms a network resource pool through node virtualization and link 
virtualization and allocates resources for the virtual network to improve network 
resource utilization. 

Based on the exchange equipment, through the definition of a common standard 
interface to achieve data plane and control plane separation, to facilitate the researchers 
in the real network environment for technical research. The SDN forwarding plane is 
separated from the control plane. The experimenter can access the underlying physical 
device with a defined common protocol and programmatically control the forwarding 
path of the data in the network. Network virtualization technology can divide the phys- 
ical network into multiple virtual networks. Each virtual network can run different 
routing protocols and provide different services. 

The actual situation of the physical network often causes the failure due to natural 
causes or malicious attacks and other non-natural, affecting the normal operation of the 
user’s business. And the previous algorithm is to enhance the physical network resource 
utilization, there is no study of network failure. Aiming at the reliability problem of virtual 
network, there is a remapping mechanism of virtual node and virtual link, which not only 
can improve the receiving rate of virtual network but also benefit the load balance of 
network. Migrating a failed virtual link to a backup path by pre-building the backup path 
is one of the methods. Another way, building a unified backup resource pool which can 
dynamically allocate backup resources for virtual links, and improve the utilization of 
physical resources. These algorithms use different mechanisms to improve the reliability 
of virtual networks. The cost of reserving backup resources in advance is too high in the 
first method. In the second method, a new path is found when the fault occurs, resulting in 
a low failure rate. In this paper, we propose a fault recovery algorithm based on the number 
of users for the virtual network mapping, and compare the network performance with the 
link remapping algorithm and the backup path construction algorithm. 


2 Mapping Model Base on SDN 


2.1 Network Description 


Virtual network requests are mapped to the underlying physical network that is made 
up of different devices. The management creates a virtual network mapping request 
based on the needs of the user and sends the request to the physical layer. Finally, the 
virtual network provider can provide the customized virtual network service to the user 
(Table 1). 
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Table 1. Parameters of the mapping models 


N, Node set 

E, Link set 

N, Virtual node set 
E Virtual link set 


Niwd (n,) Physical node forwarding sources n, € N, 

Niwd (n, ) Virtual node forwarding sources n, E€ N, 

N" (g) Controller resources n, € g 

Ne" (n,) The required control resources 

B (1,) Bandwidth resources for each physical link 1, € E, 


B (1) The required bandwidth resources for each virtual link 1, € E, 


2.2 Mapping Description 


Node mapping 

In the SDN network, not only the physical node can meet the forwarding resource require- 
ments of the virtual node, but also the domain of the physical node must meet the control 
resource requirements of the virtual node. In addition, the most classic mapping is used. 
Virtual nodes and physical nodes are one-to-one mapping. The remaining resources of the 
physical node should more than the required forwarding resources, and the remaining 
control resources in the area should more than the required control resources. 


Non) 2 Rn Oj (1a) 
N (gx) 2 R™ (n, )oj = Ln! € g (1b) 
Èy 05 = 1 Vn EN, do) 
Linen, 61 EN, (1d) 


Link mapping 

Find all nodes that meet the requirements, select one from the node that satisfies the 
requirement as an endpoint. Use the Dijkstra algorithm to find the shortest path to meet 
the bandwidth. Finally, link mapping is implemented. 


B,(I") > RI), VIP € Phe pPhsp = M(E) (2a) 


B (I) = BP) — Daman RA) (2b) 
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2.3 Objective Function 


In order to improve the efficiency of the use of physical network resources, that is, using as 
little physical resources as possible to carry as many virtual networks. This paper takes S (G, ) 
as the objective function that is the residual resource value of underlying network resources. 


S(G,) = af Za Nn!) + Neen) | +B BP) 6) 


Nw (ni) represent the remaining forwarding resources of the node, Nv" ( 2x) is the 
remaining control resource for the area, B, (1) A represents the remaining bandwidth 
resources of the link, a, B is resource conversion weight between node CPU resource 
and the link bandwidth. S (G,) denote the physical resources available, which is in direct 
proportion to the number of virtual networks supported. Therefore, set the optimization 
target to max {S (G, )}. 


3 Virtual Network Algorithm 


In this paper, the ant colony algorithm is used to solve the virtual network mapping 
problem in the software defined network model. 


The Algorithm of Virtual Network 
1. Initialization(); 

2. VN Creation(); 

3. For i=1 to N; 

4. {Update probability(); 

5. For ant=1 to M 


6. {Node Map(); 

7. Link Map(); 

8. if(S(local_solution)>S(global_solution)) 
9. global solution=local_ solution; 

10. } 

11. Update _info(); 


2. 4 
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According to the above algorithm, in each iteration process, select the network 
resource surplus value of the largest feasible solution as the current cycle optimal solu- 
tion according to S (G,). Then update the pheromone of the virtual node to the physical 
node according to the optimal solution. After N iterations, the M feasible solutions 
generated by the initialization can eventually converge to the approximate global 
optimal solution. 


The pheromone update to the physical node nÍ is as follows: 
Ti = prit- 1)+A (4) 
p is the persistence of pheromone, A is the increment of pheromone. 


4 
AS S(pe*) ? (5) 


p’*' is the optimal solution of this cycle, S(p“) is the objective function value of 
the optimal solution, ø is the influence factor of the optimal solution on pheromone. 


4 Fault Recovery Mechanism 


4.1 Fault Recovery Algorithm Based on User Number(NumMap) 


This paper presents a new fault recovery mechanism based on the previous network 
mapping model. The number of users is different during different period. So we can 
implement the strategy according to the number of users. When the number of users is 
low, there are more free link resources. We use the backup path construction algorithm 
for most users to ensure the reliability of the network as far as possible. The remaining 
users take the faulty link remapping algorithm when a failure occurs. When the number 
of users is large, the free link resources are less. So leave backup path for a small number 
of users. Most failed users are remapped. Most failed users should take the faulty link 
remapping algorithm. For each time slice, we will update the strategy based on the 
number of users. Change the proportion of pre-backup users to normal users. In this 
paper, we believe that the failure process is subject to uniform distribution. 

The free link resources vary with the number of users. A fault recovery algorithm 
based on the number of users is proposed, that is, the traffic of the pre-backup users 
affected by the fault is migrated to the backup path. And the common user’s faulty link 
is remapped. The algorithm can improve the fault repair rate and network resource 
utilization rate. 


Mapping rules 
Node mapping 
Ne (mi) 2 RM (ny oy = 1 (6a) 


1 


Ne (g) 2 R™ (ni oy = Ln! € g (6b) 
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Drie 0; = | Yni EN, (6c) 
2a wel WEN (6d) 
Link mapping 
For low level users, just meet 
B, (17) > R(I), WI" € Ph; pPhsp = M(E ) (Ta) 
B,() = B(I) - Deg RE) T) 


For pre-backup users, link mapping is not only required to meet bandwidth require- 
ments, but also to ensure that there is no overlap with the backup link. 


B, (1) > R(1,) VI? € Php (8a) 


Phi. n Ph, = NULL (8b) 


Input: /,,, W, le) 
Output: Pop 


(1) Determine the level of the virtual network carried on the faulty link; 

(2) Ifitis a high level user, judge the availability of the backup path P,,,; 
If it is a low level user, skip to (4); 

(3) Determine if the backup path is available. If it is not available, set P,,, to NULL. 
Skip to 7); 

(4) Remove the faulty link lsp; 

(5) Using the shortest path algorithm to find an alternative path P_SD. Alternative path 
and fault path have the same endpoint. If it is not found, set P_LSD to NULL and 
skip to 7); 

(6) B, (I) = B. (1) — W, (Isp ), Update the remaining bandwidth resources of the link; 


(7) If Pop is NULL, the virtual network recovery failed. Otherwise, return the recovery 
path; 


The algorithm is run one by one for all virtual networks which affected by the failure, 
and then the repair is complete. 


4.2 ResBackup Algorithm 


The res-backup algorithm is to find all the backup paths in advance and enable the backup 
path when the network fails. If the backup path is available, the backup path is returned. 
If the backup path is not available, the virtual network recovery fails. This algorithm 
has a high recovery rate, but the backup path takes up too much resource, resulting in 
reduced network utilization and degraded network performance. 
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S and D represent the two endpoints of the link; W, (sp) indicates the bandwidth 
occupied by the backup resource on the link; G, is represented as a collection of backup 
links. The algorithm is described as follows: 


Input: lsp, W, (Isp), G; 
Output: Psp 


(1) Remove the faulty link in G,; 
(2) Determine if the backup path is available. If it is not available, set P,, to NULL 
and skip to 4); 


(3) B, (°) = B, (1°) — W, (Isp), Update the remaining bandwidth resources of the link; 


(4) If Popis NULL, the virtual network recovery failed. Otherwise, return the recovery 
path; 


4.3 ResRemap Algorithm 


The ResRemap algorithm is a timely repair when the link fails. When the network fails, 
first remove the fault link from the set, calculate whether to find the link which has the 
same endpoint with the faulty link. If it can be found, update the resource, restore the 
path, and if it was not found, the virtual network recovery fails. 


Input: lsp, W, (Isp), G, 
Output: Pop 


(1) Remove the faulty link in G,; 

(2) Using the shortest path algorithm to find an alternative path Pp. Alternative path 
and fault path have the same endpoint. If it is not found, set P,,, to NULL and skip 
to 4); 

(3) B, (I) =B, (1) — W, (sp ), Update the remaining bandwidth resources of the link; 

(4) If Popis NULL, the virtual network recovery failed. Otherwise, return the recovery 
path; 


5 Simulation Parameters 


5.1 Simulation Parameters 


In this paper, we use MATLAB to generate the underlying physical network and virtual 
network request topology. 25 nodes are generated in the space of 200 * 200, and some 
nodes are generated according to the random function, which are connected to each other 
by 0.5 probability (Table 2). 
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Table 2. Parameters configuration of models 


Parameter 

Node control resources 
Link bandwidth resources 
Virtual network request 


Virtual network survival time 


Number of virtual network nodes 
Required control resources 
Required bandwidth resources 
Link failure 

Node resource weight 

link resource weight 

Number of ants 

Number of iterations 


The persistence of pheromones (p) 
The initial value of pheromone (t) 


Configuration 
a uniform distribution of [50—100] 
a uniform distribution of [50—100] 


The time unit is a time window, and the intensity of the 
Poisson process is 4 


The mean is the exponential distribution of 10 time 
windows 


a uniform distribution of [2—10] 

Each request is subject to uniform distribution [0-20] 
Each request is subject to uniform distribution [0—20] 
Each request is subject to uniform distribution [0-50] 
Set to 1 

Set to 1 

70 

70 





Fig. 1. Virtual network topology 


5.2 Evaluation Standard 


In order to evaluate the efficiency and performance of the virtual network more conven- 
iently, we define a series of parameters as the standard to evaluate the advantages and 
disadvantages of the virtual network mapping algorithm (Fig. 1). 
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(1) Acceptance ratio 


Acceptance ratio refers to how many requests are successfully mapped. It is an 


important criterion for evaluating the performance of the algorithm. N, 1s the number 


acreq 


of accepted requests. N 


alreq 


is the number of all requests. Ris the acceptance ratio. 


Rac S (9) 


(2) Fault recovery rate 


N ș is the number of failures that were successfully repaired. N, is the number of 


S 


all failure. R, is the fault recovery rate 





R, = (10) 


(3) Link resource utilization 


N, represent the link resource utilization. WN, 18 the normal link resources. Nor 
indicates the total resources of links. 


ty, = —— (11) 


total 


6 Simulation Results 


6.1 Acceptance Ratio 


When y = 0.1(y = Ary the request strength is not high, because the faulty link remap- 
ping algorithm does fot have a backup link, so there are more free network resources, 
the virtual network has the highest success rate. The backup link of the ResBackup 
algorithm takes up many link resources, so the network running the lowest success rate. 
The NumMap algorithm backs up some users’ links and takes up part of the additional 
link resources, so the R_,.1s between the two algorithms. 

The user’s requests get more when y = 0.5. The success rate of ResRemap algorithm 
is still the highest. But the R, decreased from 67% to 55%. And The ResBackup algo- 
rithm back up link in advance, taking up a lot of link resources, so the the change of 
running success rate is not obvious for virtual network. The R, of the virtual network 
operation of the fault recovery algorithm is slightly lower, but still between the two 
algorithms (Fig. 2). 
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—‘V— ResRemap 
—©— Nummap 
—A— ResBackup 





—V— ResRemap 
—S— Nummap 
—A— ResBackup 


0.1) 
0.5) 


Acceptance ratio(y 
Acceptance ratio(y 





Time 


Fig. 2. Acceptance rate for three algorithms 


6.2 Fault Recovery Rate 


When y = 0.1, the high-level user virtual network failure repair rate close to 100%. Low- 
level user fault repair rate is basically stable at 77%. The Resbackup algorithm has the 
highest repair rate, close to 100%. The fault repair rate of ResRemap algorithm is basi- 
cally stable at 76%. When y = 0.5, the number of users increased significantly, but the 
fault repair rate of the the high-level users is still about 95%, the fault repair rate of low- 
level users stabilized at 60% or more. The fault repair rate of ResBackup algorithm is 
higher. Because the backup path algorithm has set up a backup path for a virtual network 
that is successfully mapped before a link failure occurs (Fig. 3). 


=0.1) 
=0.5) 





Fault recovery rate(y 
Fault recovery rate(y 


—A— ResBackup 

—€6— Nummap(high-level) 
—V— ResRemap 

—*— Nummap(low-level) 


—A— ResBackup 

—€— Nummap(high-level) 
—v— ResRemap 

—*— Nummap (low-level) 








Fig. 3. Fault recovery rate for three algorithms 


6.3 Link Resource Utilization 


The number of users is large when y = 0.1.Because it was when the link malfunctioned 
that ResRemap started to repair, there is no backup link to seize resources, so y, 1s the 
highest, reaching 50%. ResBackup takes up a lot of link resources, which backup link, 
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so y,, is the lowest. Fault recovery algorithm reached 45%, based on user number, 
backuped part of the user’s link according to the number of users, which link resource 
rate is between the two algorithms above. In the case of y = 0.1, link resource utilization 
declined (Fig. 4). 





=0.5) 
=0.1) 


Link resource utilization(y 
Link resource utilization(y 


—vV— ResRemap 


—vV— ResRemap 
—©— Nummap 
—A— ResBackup 


—©— Nummap 
—A— ResBackup 














Ç; j 
0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 
Time 


Fig. 4. Link resource utilization for three algorithms 


7 Conclusion 


The virtualized network represented by SDN network can meet the diversified business 
needs of different users, support multiple routing protocols, protect the security of user 
information, and promote the evolution of traditional Internet architecture to the next 
generation architecture. In this research, based on the premise of single link failure, 
which is the most prone to network in the network, based on Network reliability require- 
ments diversity from users, we designed a fault recovery algorithm based on number. 
Finally, we proved the superiority of the algorithm from the aspects of virtual network 
fault repair rate, successful operation rate of virtual network and utilization of work link 
resource. 
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Abstract. Localization is a pivotal technology in wireless sensor networks and 
location information of sensor nodes is essential to location-based applications. 
In the beacon-based localization, the reliability of beacons’ location information 
is critical to the quality of network service. In this paper, we study the influences 
of drifting beacons the network localization. So according to this scenario 
mentioned above, we propose a distributed and lightweight beacons locations 
verification algorithm based on neighborhood-similarity (BLVNS), which 
utilizes similarity of the beacons’ neighborhood in different time slot to recognize 
drifting beacons. The whole algorithms can be applied to the static and dynamic 
WSNs to improve the accuracy of range-free localization. Experiment results 
show that our algorithms can recognize unreliable beacons with detection rate 
higher than 90%. 


Keywords: WSNSs:- Reliable localization - Drifting beacons 
Range-free localization 


1 Introduction 


Wireless sensor networks (WSNs), which consist of a large number of sensor nodes, 
have been widely used in military and human daily life, e.g. surveillance, environmental 
monitoring system and medical health [1]. Localization is one of the most essential 
research issues in WSNs because the sensed information without location is insignifi- 
cant, in some scenarios such as environment monitoring, target tracking, and geograph- 
ical routing [2]. To acquire the locations of sensor nodes, we can either mount nodes 
with GPS receiver or predefine nodes’ positions manually in deployment. Because of 
relatively high price and energy-extensive consumption, GPS receivers may not be 
available for power-limited and low-price WSN, and the second method is not available 
for large scale WSN. As a result, we always predefine a small part of nodes’ locations 
manually in deployment, which are called beacon nodes. And other nodes are called 
unknown nodes. Take the location of beacons as reference, the normal sensor nodes can 
estimate their locations using some certain localization algorithms. These localization 
algorithms can be divided into Range-Based localization algorithms and Range-Free 
localization algorithms [3]. The former assumes the distances between sensors and 
beacons can be estimated by using different measurements, such as TDoA, ToA, AoA 
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and RSSI [4—7]. These algorithms can usually provide higher positioning accuracy with 
higher hardware cost. While the latter estimates location of normal nodes based on the 
features of networks, such as hop counts [8], topology of network [9], etc. 

In the study of localization of static WSN, we usually assume that beacons’ locations 
are reliable. Nevertheless, for many WSN systems deployed in unstable environments, 
all nodes may be moved unexpectedly and Beacons have hardware failures or be 
captured to provide false locations. The drifting beacons are called unreliable beacons. 
The locations of drifting normal nodes can be re-localized by re-invoking localization 
algorithms. But when unreliable beacons’ locations are used in re-localize process, it 
will seriously degrade the accuracy of re-localization and affect the quality of network 
service. So, recognizing and filtering out these unreliable beacons becomes one of the 
most important research issues in node localization of WSNs. 

To solve these problems mentioned above, we proposed a distributed and lightweight 
beacons locations verification algorithm based on neighborhood-similarity, which use 
the similarity of the beacons’ neighborhood in different time slot to verify which beacons 
are drifting. 


2 Related Works 


At present, the existing study of reliable localization in WSNs is divided into robust 
localization algorithms [10] and unreliable beacons filter algorithms [11]. The former 
is applicable in some scenes, where there exist the interference of ranging information 
or small movement of beacons. The main idea of these algorithms is reducing the envi- 
ronment interference of localization. The latter algorithms verify beacons locations to 
recognize and filter out the unreliable beacons. As a result, unreliable beacons filter 
algorithms are more universal and can be used in localization algorithms to ensure 
localization accuracy. 

In [12], it proposed a point to point location verification algorithm, which can be 
applied to range-based algorithm. But the nodes are equipped with GPS receiver. DJ. 
He et al. proposed a location verification algorithm based on TOA to eliminate abnormal 
range value, which can be applied to Range-Based localization algorithms [13]. Kuo 
et al. proposed beacon movement detection (BMD) algorithm to detect the unexpected 
movements of beacons [14]. The basic idea of BMD algorithm is to let the beacons 
record the variances of RSSI measurement results between each other and report to a 
calculation center to determine the moved beacons. Like all other centralized algorithms, 
BMD algorithms will bring a heavy communication burden and need a sink node or a 
computer with a strong ability to calculate, which is not fit for WSN. There are also 
some related works, which use the location of the hidden checked nodes to verify the 
locations of beacons, which are also a centralized algorithm and needs involvement of 
additional nodes [15]. In [16], by using rigid theory to exclude the abnormal location of 
beacons, it can provide reliable localization results. However, the rigid theory relies on 
high accuracy of range and the algorithm’s computation is too heavy. Ravi Garg et al. 
eliminated the beacons, which provide a larger descent gradient during the localization, 
to improve the credibility of localization. The algorithm did not consider the reference 
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of the Locations of normal nodes so that it can use in beacon-sparse networks [17]. 
Yawen Wei et al. presented a location verification probabilistic model based on the 
mutual observation between neighbor nodes, which achieved a better result [18]. In [19], 
it developed a distributed neighbor nodes score mechanism based on variation of RSSI 
to identify the moved beacons. However, all of above methods can’t be suitable for 
Range-Free algorithms and solve single problem. 


3 Problem Statement 


3.1 System Model and Assumptions 


We assume that there are two types of nodes deployed in the network: beacons and 
sensors. The beacons know their locations in advance. The sensors do not know their 
locations, which are also called unknown nodes. But they can estimate locations by using 
localization algorithms with beacons’ locations. All nodes are mounted with RSSI 
transceiver. All nodes can be moved unexpectedly. Since drifting unknown nodes are 
always existed, we need to re-localized the network. 

Besides, we assume all nodes’ communication ranges have the same radius. Notice 
that environmental interruptions and permutations exist, so that neighborhood observa- 
tion is not always symmetric. Our algorithm also can solve such asymmetry problem. 
In the initialized localization, we assume that beacons are reliable. And each node can 
participate in the algorithm calculation. During the calculation, nodes are static. In the 
most cases, the drifting nodes are 10%—20% of the total nodes. Unreliable beacons’ 
percentage is not over that 50% of beacons [20]. For conveniently introducing algorithm, 
our notations are introduced in Table 1. 


Table 1. Notations 


S; Node i 

B; Beacon 1 

ID(i), At t time, the set of node 7z’s neighbors 

Nei(i, j) At t time,the common set of node i’s and node j’ s neighbors 
DisRank(i, j) At t time, the actual distance coefficient between node i and node j 
EstDisRank(i, j) At t time, the estimated distance coefficient between node i and node j 
EstDis(i, j); At t time, the estimated distance between node i and node j 
IDSame(i)¢, 441) In the (t, t + 1), the change of ID(i), and ID(i) 4. 

neiv(1), At t time, a vector to store DisRank(i,j),,j € ID@, 

Relation(i, j)t At t time, the relationship of between node i and node j 


3.2 Unreliable Beacons Models 


At first, in this paper two kinds of unreliable beacons are defined as fallows: 
Drifting Beacons: In some application scenarios, after deployment of network and 
completion of localization, all nodes may be moved unexpectedly. Among these drifting 
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nodes, beacons are called drifting beacons. As an example, Fig. 1 shows a scenario where 
beacon B2 is moved, whose broadcast location doesn’t match its actual location. So 
during re-localization, if re-invoking localization algorithms utilize B2’s location, it can 
degrade the accuracy of re-localization. 


Sie) 


Beacon 
Sensor 
Drifted Beacon 
Malicious Beacon 





Core 


B2 «) B7'w) 


Fig. 1. Unreliable beacons models 


3.3 Problem Model 


Drifting beacons lead to degrade accuracy of re-localization. Before re-localizing the 
network, we should recognize and filter out the unreliable beacons. We assume that only 
a small apart of beacons are unreliable beacons, the number of which is defined as h. 


The set of unreliable beacons is A, = {a,:k =1--h,h<< m}. The set of beacons is Ay 
and the number of beacons is m. A’, is denoted as the set of unreliable beacons verified 
by proposed verification algorithm g(¢), so g(A,) = A’, In this paper, we design a location 


verification algorithm g(«) to make A’, close to A4, i.e.: 


Min|A, — A’, 


4 DV-Hop Localization Algorithm and Hop Count Correction 


In this section, we describe the DV-Hop localization algorithm and how to correct hop 
count by using RSSI technology. 


4.1 DV-Hop Localization Algorithm 


Niculescu et al. [21] proposed the DV-Hop localization algorithm by utilizing distance 
vector routing mechanism. It has three phases [22]: 

In the first phase, the network employ a typical distance vector routing mechanism. 
Beacons flood their locations throughout the network with the initial hop-count of 0. 
Each node that relays the message increase the hop-count by one. After the flooding 
procedure, every node can obtain the minimum hop-count to each other. 

In the second phase, after obtaining the locations and hop-count information to all 
other beacons, each beacon estimates the average distance per hop. B; calculates the 
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average distance per hop, called H;, using the formula (1). Where (x;, y;) is the coordinate 
of B;. h;; is the hop-count from B; to B;. Then, H; will also be flooded to sensors near to B;. 


Dij (x; — x)? + 0; yy 
ms ea (1) 
Digs hi 


In the last phase, before conducting the self-localization, each sensor estimates the 
distance to each beacon based on its hop-count and the H to this beacon. After obtaining 
all the distance information, each sensor conducts the triangulation or maximum like- 
lihood estimation to estimate its own location. 


4.2 Hop Count Correction 


In this section, we utilize the RSSI technology to correct the hop count between nodes, 
which reflects the distance proximity between nodes. In the recent study of DV-Hop 
localization algorithm, many papers utilize the RSSI technology to correct the hop count 
between nodes [22, 23]. In this paper, we use this idea to correct the hop count to help 
nodes get the distance proximity with their neighbors. 

The coverage area of WSNs is large and the environments of different regions are 
different, but in some scenarios, the environment is relatively stable, such as the range 
of node and its neighbors. Based on this assumption, we can utilize the RSSI technology 
to correct the hop count. 

At first, each node transmits the information of RSSI to its neighbors. So each node 
gets the RSSI values of its neighbors. Secondly, each node normalize the RSSI values, 
using the formula (2). Rssi(i, j) represents the RSSI value between S; and S, and 


min (Rssi(i , m),) ,m E ID(i), is the minimum of these RSSI values. For RSSI value is form 
—95 dbm to —55 dbm, the rssi is a constant and set as —50 dbm. It is convenient for the 
proposed algorithms to calculate similarity. At last, each node gets the distance coeffi- 
cients with its neighbors. Besides, if S; can’t communicate with $, DisRank(i, j), = 0. 


Rssi(i, J), — rssi 


DisRank(i, j), = „m E ID(i), (2) 


min (Rssi(i, m),) — rssi 


5 BLVNS Algorithm 


In this section, we describe beacons locations verification algorithm based on neigh- 
borhood-similarity. The BLVNS uses similarity of the beacons’ neighborhood in the 
different time slot to recognize drifting beacons. The neighborhood reflects in two 
aspects: the set of neighbors and the distances between the node and its neighbors. 
Besides, BLVNS can effectively minimize the influences of drifting neighbors. 
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5.1 Neighborhood Relationship 
(1) Set of neighbors 


All nodes may be moved unexpectedly. When one node is moved, its neighbors are 
changed. As an example, Fig. 2 shows a scenario where B, is moved in the (tọ, t1). At 
time tọ, its neighbors are S;, $4 and S6. But at time t4, its neighbor is only S5. Its neighbors 
are changed obviously. We denote sensor S; s change of neighbors by IDSame(i)g 1) 10 
the (t, t + 1), using the formula (3). 


DO, A IDG), 


ar aE an= IDO, U IDO), 2 


-2-7 


w.. „A 





A Sensor Location 


@ Beacon Location 


Fig. 2. The process of drifting nodes 


(2) Distances between neighbors 


If a node is moved in small distance, its neighbors are not changed completely. But 
the distances between node and these neighbors are likely changed. As Fig. 2 shows that 
after B, is moved, S;1s still its neighbor. However, the distance between them is changed. 
Nodes hardly get the distances with their neighbors based on Range-Free localization 
algorithms. So In the Sect. 4.1, we calculate DisRank(i, j), to reflect distance proximity 
between neighbors. 


5.2 Similarity of Neighborhood Relationship 


In the (t, t + 1), At time ¢, S; gets ID(i), and DisRank(i,j),,j E ID(@),. At time t + 1, S; 
gets [D(i),,; and DisRank(, j),.,J E€ ID@),4;- [DSame() +1) 18 computed by the formula 
(3). Neighborhood relationship of S; and S; s at time tis defined as Relation(i, j). Rela- 
tion(1, J), and Relation(i, j); are computed by 


DisRank(i,j), « IDSame(j) e441 j € (ID@, NID@,41) 
Relation(i,j), = 4 DisRank(i,j), - (1 — IDSame(i),.41)) j E UD@,/IDW 41) (4) 
0 j € (ID@,.,/ID@,) 
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n _ J DisRank(i,j), -[DSameQ) 441) J E DO «1 
Relation(i, j); = { 0 je (ID(i),/ID(i),.1) (5) 


In the formula (4) and (5), to minimize the influences of drifting neighbors, 


DisRank(i, j), is multiplied by [DSame();,..;). When S; is static but S, is moved, the 
change of DisRank(i, j), and DisRank(i, j); is caused by S;. Because of S;’s movement, 


IDSame(j)(,.;41) 18 closed to 0, which makes Relation(i, j), and Relation(1, j), close to 0. 
Besides, for j € (IDC), /IDÒ 1 } at time ¢ + 1, S; is not S;’s neighbor, so S; can’t get 
S; s IDSame(j)g 1) IDSame(j)g 1) 18 replaced by 1 — IDSame(i)g 1) If S; is moved, 


1 — IDSame(i); 1) is closed to 1. If not, 1 — IDSame(i); +1) is closed to 0. These opera- 
tions can minimize the influences of drifting neighbors. 

After S; calculates Relation(i, j), and Relation(i, j) of each neighbor, We define 
vectors RelationV(i), and RelationV(i),,, to store these data. Finally, we use cosine law 
to calculate the similarity of Relation V(i), and RelationV(i),,, by 


NeiSim(i) = RelationV (i), - RelationV(i),, | 
ve ||RelationV(i),|| - ||RelationV(i),,.|| (6) 


6 Experiment and Analysis 


In this section, firstly, the simulation results are presented to validate the performance 
and robustness of our proposed algorithms. Then, the algorithms are applied in the 
dynamic WSN to improve the accuracy of re-localization. 

BLVNS is based on neighborhood mutual observation, so the average connectivity 
degree of network is more than 15. The network configuration of our simulation is set 
as follows: 150 nodes, including 15 beacons and 135 sensors, are deployed randomly in 
a 150 m x 150 m region. The transmission range of each nodes equals to 30 m. We use 
the signal attenuation model to simulate the RSSI value between nodes, by using the 
formula (7). Where py is the path dissipation function mattered with nodes’ distance. 
dą is the distance between sender $, and receiver $}. dp is a reference distance and equals 
to 1 m. n, is an exponent of path loss. € is an error coefficient. 


_ dy 
Pu = (>. = 10n,te(“) ) CL =e) (7) 
0 


We use success detection rate R, and error rate R, to evaluate the detection perform- 
ance. The calculation of R, and R, is given in (8) and (9), in which B, is the set of 
unreliable beacons, B,, is the set of unreliable beacons detected by algorithms. 


— Num(B, 0 Bu) 


J Num(B,,) (8) 
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_ Num((B- B,) A Ba, ) 


R, = (9) 
Num(B,,) 

The goal to this experiment is to study how the number of drifting nodes impact the 
performance and analyze the environmental adaptability of the verification algorithms. 

At first, the measurement of RSSI does not exist any error, € = 0. And the thresholds 
of BLANS and BLATM are 0.6 and 0.45 respectively. Figure 3 shows the performance 
of the verification algorithms with the drifting beacons. For example, when exponent of 
path loss is 1.5, 2, and 2.5, respectively, R, of two algorithms is maintained at about 95% 
and R, is under the 20% with the drifting nodes increasing. So the influences of drifting 
sensors are small. When the measurements of RSSI exist errors, we set € = 0, 
€ E (—0.05, 0.05) and £ € (—0.1, 0.1) respectively. Figure 4 shows the performances of 
the verification algorithms with measuring errors. Compare Fig. 4 with Fig. 3, we can 
find their performances are similar so that the influences of RSSI measuring errors are 
small. The experiment shows that BLVNS is robust. 


BLVNS Rs/Re 








Number of drifting nodes 


Fig. 3. Performance of BLVNS with different drifting nodes and environments 


—+*— £e (-0.05,0.05) Rs 
—— £e (-0.1,0.1) Rs 
—----- €=0 Re 


BLVNS Rs/Re 


—e— £e (-0.05,0.05) Re 
ee (-0.1,0.1) Re 











Number of drifting nodes 


Fig. 4. Performance of BLVNS with different drifting nodes and measuring errors 
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7 Conclusion 


In this paper, we analyze the severe impacts of the unreliable beacons on Range-Free 
localization algorithms, which reflects the importance of beacon location verification. 
To eliminate the influences of localization arise from these unreliable beacons, we 
propose the algorithms BLVNS which can efficiently recognize and filter out the drifting 
beacons. BLVNS can minimize the re-localization error and have strong anti-jamming 
capacity in different environments and networks. Future study will extend the location 
verification model to real-world experiments. 
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Abstract. For planetary reducers, most components have certain characteristics 
and structure. In order to obtain the 3-D models and engineering drawings of 
similar components which have the same characteristics and structure only 
different in size, a digital design system of planetary reducer is developed by 
parametric design method. The system is planned according to the 3-D parametric 
design and structural analysis of planetary reducers and the parametric operation 
module is also designed. In parametric operation module, 3-D models of main 
components of the planetary reducer are taken as templates; the parameters of 
template are selected and modified by using VB platform to recall controls of 
SolidWorks. The 3-D models and engineering drawings of similar components 
are output rapidly. The efficiency and correctness of planetary reducers are 
improved by this system, which is also provided as a reference for designers and 
enterprises to produce other serial products. 


Keywords: Planetary reducer - Parametric design - VB - SolidWorks 


1 Introduction 


Most components of the mechanical product have certain characteristics and structure. 
Applying the parametric design method, it is helpful to improve the design efficiency 
and simplify the design work by using the established 3-D model library, and generate 
3-D models of these components with same characteristics and structure only different 
in size [1]. 

SolidWorks is a kind of powerful 3-D modeling software [2]. And it is able to meet 
the specific needs of enterprises in the integration of the secondary development and the 
parametric design. Variables are used to replace fixed parameters of components through 
the parametric design in the physical modeling. The parametric design of similar compo- 
nents is completed through the modification of variables. 

Enterprises and designers always pay attention to design, development and manu- 
facture of planetary reducers. Especially for the aspects of standardization, serialization 
and generalization of planetary reducer components, it has become a general trend to 
achieve parametric design of reducer components [3]. The rapid development of the 
computer technology and CAD technology provides software with realizable conditions 
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for parametric design [4]. To design the planetary reducers, the mathematical model and 
product structure are kept the same while the parameters in size should be modified [5]. 
Therefore, the use of digital technology to achieve parametric design of planetary 
reducers helps to improve efficiency and correctness of the serial design. 


2 Realization of Digital Design System 


2.1 Secondary Development Based on ActiveX Automation in SolidWorks 


The service program (Server) based on ActiveX Automation technology is able to be 
controlled and accessed by the client program (Client). In the study of secondary devel- 
opment of SolidWorks with Visual Basic 6.0, the service program is SolidWorks and 
the client program is VB [6]. The language of program is used by the client program 
(VB) to communicate directly with the service program (SolidWorks) and manipulate 
functionality of the software. 

In the development of the planetary reducer system, the relationship between the 
service and client programs is shown in Fig. 1. 


Parametric Design System of Planetary Reducer 
(Client) 
ActiveX Automation 
SolidWorks2013 (Sever) 


Fig. 1. The relationship between the service and client programs 







2.2 Planning and Realization Process of Digital Design System 


The parametric design of main components of planetary reducers is completed in the 
digital system planned in this paper, including overall design, 3-D modeling of main 
components, automatic output of engineering drawings, automatic assembly and finite 
element analysis of the reducer [7]. For design goals above, the digital design system is 
planned as shown in Fig. 2. 
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Automatic assembly 
General parts and 
standard parts 


Parametric 
operation 


Overall design 


Fig. 2. Design of digital design system 





Parametric Design System 
of Planetary Reducer 








In Fig. 2, the parametric operations (3-D modeling and engineering drawing), auto- 
matic assembly, and finite element analysis in the component parametric design are key 
modules of the digital design system. 

The block diagram of the planetary reducer system is shown in Fig. 3. 


Enter parameters of 
overall design 


Parametric design of 
components 
Update the model of 
components 
Update the model of 
assembly 
Check of motion 
intervention 
Generate engineering 
drawings 


Fig. 3. Flow chart of planetary reducer digital design system 
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2.3 Design of Digital System 


The process of secondary development for SolidWorks combined with VB is completed 
in this paper [8]. The parametric design of the two-stage NGW planetary reducer is taken 
as an example to complete the design of the digital design system. 

The two parts completed are introduced as follows: main interface of the design 
system and parametric operation. 

Figure 4 is main interface of the design system. Figures 5 and 6 are parametric oper- 
ation module. The parametric operation module is one of core modules in the design 
system, including parametric 3-D modeling of main reducer components and automatic 
output of engineering drawings [9]. 
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Fig. 4. Main interface of the design system 
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Fig. 5. Parametric design of the first stage gear 
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& Parametric Operation 


Information of Engineering Drawing 


Material: |40Cr Name: [Input Shaft Code Name: [sprnso Number: fi Total Number: |1 Note: | 


Generate 3-D Model | Generate Engineering Drawing | Back | 


Fig. 6. Parametric design of the input shaft 








Users are able to select first and second transmission system through modifying 
parameters of the gear to generate 3-D models and engineering drawings of new gears, 
and the parametric design of components is achieved as shown in Fig. 5. 

After completing the selection of the gear type, the remaining components are able 
to be designed accordingly as shown in Fig. 6. 

After modifying parameters, the new 3-D model and corresponding engineering 
drawing will be automatically saved to a new folder, so that the product will be viewed 
again and corresponding changes can be made by the user. 


3 Parametric 3-D Modeling of Main Components 


3.1 Parametric Modeling Method 


At present, the following two methods are commonly used in parametric design [10]: 


(1) The modeling of parameter modification method: The corresponding parameters in 
the 3-D model template are modified through invoking program to obtain new 3-D 
models. 

(2) The modeling of program method: The 3-D model is built by the program fully. 


The parameter modification method is used by this paper. Figure 7 is the flow chart 
of the parameter modification method. 
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Call the API object and assign a new value to the variable 


Generate a new model of the component through VB program 


Fig. 7. Flow chart of the parameter modification method 


3.2 Realization of Parametric Modeling Process Based on Characteristics 


The feature-based components with same characteristics are able to be generated by 
using the same “template model”. These components need not to be re-modeled when 
the feature size changes, which is the meaning of the parametric modeling [11]. When 
the model with a different size is required, the “template model” is able to be used for 
generating the 3-D model quickly as shown in Fig. 8. 


Parameter library Template library 


Characteristic parameters the template of components 


Mechanism of 
parametric drive 


Generate components 


Interface of the user 


Preview the components 














Fig. 8. Parametric method of “template model and parameter driven” 


Following the mechanism of the parametric feature-based “template model and 
parameter driven”, users are allowed to modify size parameters of the component 
template and obtain components that meet requirements of the design rapidly [12]. At 
the same time, it can meet the needs of users for the personalized customization of 
components and the establishment of non-standard components. 
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In general, the common steps for the parametric establishment of a 3-D model are 
shown in Fig. 9. 


( Start 





> <modiy > 


Y 











Develop parametric program 


Fig. 9. Parametric modeling process based on template model 


3.3 Example of Parametric Modeling 


In this paper, 3-D models of main components of the two-stage NGW planetary 
reducer — including input shaft, sun gear, planetary gear, ring gear, planetary carrier, 
pin and output shaft are modeled. 

Taking the output shaft of the reducer as an example, parametric modeling is 
performed according to Fig. 9. 

The 3-D model template of the output shaft is shown in Fig. 10. 

The parametric model interface of the output shaft is set up with VB to develop 
SolidWorks API function as shown in Fig. 11. 
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Fig. 10. 3-D model template of the output shaft 
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Fig. 11. Parametric modeling of output shaft 
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After modifying the parameters of the output shaft, according to the modification 
process, new parameters are transferred to corresponding parameter names of the output 
shaft. Then corresponding parameters in the template of the output shaft are modified 
and the modification of size parameters and custom properties 1s completed. Finally, a 
new 3-D model and corresponding engineering drawing of the output shaft can be 
generated, and the parametric operation of components is achieved [13]. 


4 Automatical Output of Engineering Drawing 


4.1 Overall Framework of Automatic Adjustment System 


After building a 3-D model, it is convenient to generate 2-D engineering drawings. But 
the parametric ratio of the pattern, position of view, size and notes are not processed 
automatically. These problems will result in defects of chaotic layouts and uncoordi- 
nated proportions, so it is not suitable for guiding the production directly [14]. Therefore, 
it is necessary to carry out automatic adjustments for engineering drawings from Solid- 
Works, so that output engineering drawings can meet the requirements of enterprise 
standards and facilitate rapid productions. 
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and assemblies 
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AdJust the position 
Interface of VB of annotations 


Save the new 
engineering drawing 


Coordinates of 
annotation 


Fig. 12. The process for the automatic adjustment of engineering drawings 
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When the new 3-D model is generated, the functions of the adjustment of engineering 
drawings are able to be achieved with VB [14]. The process for automatic adjustments 
of engineering drawings is shown in Fig. 12. 


4.2 Example for Automatical Output of Engineering Drawing 


The steps for the automatical output of engineering drawings are following: (1) Establish 
a template for the drawing view; (2) Establish a template of engineering drawings; (3) 
Set custom properties for the template of engineering drawings; (4) Adjust the view for 
the template of engineering drawings; (5) Save and manage the template of engineering 
drawings. 

According to the steps above, the planetary carrier of the reducer is taken as an 
example to complete the automatical output of the engineering drawing. Figure 13 shows 
the custom properties of the planetary carrier [15]. 


Information of Engineering Drawing ee a 


Material: f40Cr Name: [Planetary Carrier Code Name: [SPLO60 Number: f1 Total Number: fi Note: | 


Fig. 13. The custom properties of the planetary carrier 


Figure 14 is the dialog box of custom properties in SolidWorks after the custom 
properties of the planetary carrier is assigned. 
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Fig. 14. The dialog box of the custom properties 


Figure 15 is the result of the planetary carrier generated automatically. 
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Fig. 15. The result of the engineering drawing automatically generated 


Conclusions 


Based on the understanding of the structure and design steps of the planetary 
reducer, the general framework of the digital design system for the planetary reducer 
is planned by the parametric design method. The development of the parametric 
operation module in the system is completed by using SolidWorks as the develop- 
ment platform and Visual Basic as the development tool, and the automatic output 
of 3-D models and engineering drawings of the same planetary reducer is realized. 
The system is designed to provide enterprises and designers with a way to get serial 
products rapidly in order to improve the efficiency and quality of design and make 
the product more standardized and versatile. 
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Abstract. Cooperative Spectrum Sensing (CSS) is the critical component of the 
cognitive radio technology, which could relieve the shortage of spectrum 
recourses. The double threshold based CSS scheme was adopted to deal with the 
unreliability when the original energy value locates around the traditional single 
threshold. We utilize the dynamic double threshold based energy detection 
scheme, in which it can adjust the double threshold properly to respond the trans- 
formation of channel condition. This proposed dynamic double threshold based 
scheme could be verified by analyzing the formula and simulation, and it achieves 
the better performance compared with classical double threshold based CSS 
scheme. 


Keywords: Cooperative spectrum sensing - Dynamic double threshold 
Energy detection - Cognitive radio 


1 Introduction 


The cognitive radio technologies adopt strategy which shares the frequency spectrum 
among the users, and the secondary user (SU) opportunistically occupy the spectrum 
only if spectrum hole has been detected, which means the primary user (PU) doesn’t 
occupy the spectrum band at a certain period of time [1]. Before using shared spectrum 
among users, spectrum sensing is necessary to determine if there is a PU currently 
occupying the licensed spectrum band [2]. 

In this paper, we adopt dynamic double threshold method which is set dynamically 
according to the instantaneous SNR, so that further improve the performance of double 
threshold energy detection based CSS scheme. 

The rest of the paper is organized as following: Sect. 2 describes the proposed system 
model. Section 3 demonstrates the single threshold energy detection based CSS scheme. 
Section 4 proposes a dynamic double threshold energy detection based CSS scheme. 
The simulation results and analysis are presented in Sect. 5 and the conclusion from the 
paper is drawn in Sect. 6. 
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2 System Model 


There are three phases in the sensing system: sensing, reporting and broadcast. The SUs 
receive the signal transmitted from PU and transfer the signal into the form of energy 
value [3], then make the local decision or remain the original energy value. The local 
sensing results would be reported to the fusion center at reporting phase. This process 
would require some bits of data as communication overhead [4]. The finally global 
decision would be made in the fusion center, then all the SUs would get the final results 
whether PU is occupying the licensed spectrum band. 


3 Single Threshold Based Cooperative Spectrum Sensing 


3.1 Probability of Detection and False Alarm 


We choose P, and P, to express the probability of detection and probability of false 
alarm respectively. We only consider the channel with Additive White Gaussian Noise 
in this article, so we can get the probability of detection and false alarm as following [5], 


Pa = P{E, > AH, } = Ou( V27» va) (1) 


TM, 4/2) 


P; = P{E, > A\Hy } = ri) 


(2) 


where Q,,(..) is the generalized Marcum Q function [6] and I(..) is upper incomplete 
gamma function. H, denotes that the PU is absent of licensed spectrum band, while H, 
denotes that the PU is present of licensed spectrum band. 


3.2 Fusion Decision 


K out of N rule is belong to the hard fusion decision scheme, which can also be called 
voting rule or majority rule [7]. The fusion center deal with the local sensing results 
reported by each SU. Assuming that the number of SU processing spectrum sensing is 
N. The fusion center makes the final decision that PU is present of the licensed spectrum 
band only if there are at least K SU indicating the licensed spectrum band being occupied 
by PU. The finally global probability of detection and false alarm can be formulated as 
following, 


R: 
3 
= 

I 
M= 


(2) I’. I (I =- Py) (3) 
I, I (1 = Py) (4) 
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The Maximal Ratio Combining (MRC) [8] method adopts strategy that getting the 
weighted summation, in which the normalized weight factor is set according to instan- 
taneous SNR. The method can be formulated as following, 


P mrc = P{ Eurc > Aurc|H, } = 0,,( V2Yurc: V Tune ) (5) 


T(M, Ayprc/2) 


Prurc = P{Eurc > Aurc|Ho } = TM) (6) 
N N y. y. 
where Yyrc = È Yp Eurc =  WiEp W; = r FE 
i=1 i=l YMRC 
i=l 
4 Dynamic Double Threshold Based Cooperative Spectrum 
Sensing 
For a given threshold A, the dynamic threshold is formulated as following. 
Agi = WX A (7) 
Ay =(2-—w) x4 (8) 
= Yi 
S a a S .. A 9 
max (115725 eee ooo Va) ( ) 


The fusion center makes the global decision GD1 based on the local binary decision, 
and makes the global decision GD2 based on the original energy value. Finally, we can 
get the final decision which is expressed as following. 


= (0, GD,+GD, <1 
a f 1, otherwise (10) 
_ jf 0,L,<kK 
c= f 1, otherwise (L) 
G 
0, Eye =F wE, <A 
GD, —_ MRC 2 W; J MRC (12) 


1, otherwise 


where L, is the number of reporting binary decision which indicating the present state 
of the PU. 
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The final decision is made in the fusion center. We could compute the final proba- 
bility of detection and false alarm which is formulated as following. 


G 
Py= t= (1 Pa)(1~ Pau [TPs (13) 


i=1 


G 
ia (= Pa)(1 -Panex [LP (14) 


i=1 


where Pg and Pg is computed by using K out of N rule, Paygc and Prypc 1s computed 
by using MRC, G is the number of original energy value reported from SU, the proba- 
bility of energy located between double thresholds must be taken account, which is 
expressed as following. 


n- (M, Ag:/2)  T(M, A,;/2) da 
oi IM) T(M) 


Pi, = Ou( V2x; Via) - Ou( V7; Vii) (16) 


5 Simulation Results and Analysis 


The simulation is carried out in the assumption that channel is along with AWGN. The 
time bandwidth product M = 5. We would compare the dynamic double threshold based 
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Fig. 1. Detection probability curve for single threshold based CSS using K out of N 
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CSS scheme with the fixed double threshold scheme in which it is set as 
Ay = 0.8A, A, = 1.2A. 

Figure 1 shows the relationship curve of probability of detection and false alarm 
using single threshold based CSS scheme, in which the final probability is computed by 
using the K out of N rule. Where the SNR is randomly generated between (—15 ~ 10) 
dB. We can find that the more SUs sensing spectrum 1s, the higher the probability is. 
Meanwhile the smaller the K is, the higher the probability is as expected. Figure 2 shows 
the changing tendency of detection probability along with false alarm probability which 


probability of detection 
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Fig. 2. Detection probability curve for single threshold based CSS using MRC 
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Fig. 3. Detection probability curve for dynamic double threshold based CSS 
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utilizes the MRC method to complete the final decision. Apparently to achieve higher 
detection probability it requires the number of original energy value as many as possible. 

Figure 3 shows the proposed dynamic double threshold based CSS scheme under 
the condition (N = 10, K = 10), meanwhile compares with the scheme which ignores 
the energy value between the double thresholds and the scheme which deals with the 
original energy reported from the SUs further. The proposed scheme presents a better 
performance due to computing with the energy value between the double thresholds and 
adjusting the threshold according to the instantaneous SNR appropriately. 


6 Conclusion 


The proposed dynamic double threshold based CSS scheme could realize better perform- 
ance by using MRC to deal with the original energy value between the double thresholds 
and utilizing K out of N rule to deal with the local binary decision, the critical factor is 
dynamic threshold adjusted according to instantaneous SNR. 
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Abstract. Among the diverse distributed query and analysis engine, Kylin have 
gained wide adoption since its various strengths. By using Kylin, users can 
interact with Hadoop data at sub-second latency. However, it still has some 
disadvantages. One representative disadvantage is the exponential growth of 
cuboids along with the growth of dimensions. In this paper, we optimize the 
cuboid materialization strategy of Kylin by reducing the number of cuboids based 
on the traditional OLAP optimization method. We optimize the strategy mainly 
from two aspects. Firstly, we propose Lazy-Building strategy to delay the 
construction of nonessential cuboid and shorten the time of cuboid initialization. 
Secondly, we adopt Materialized View Self-adjusting Algorithm to eliminate the 
cuboids which are not in use for a long period. Experimental results demonstrate 
the efficacy of the proposed Distributed Self-Adaption Cube Building Model. 
Specifically, by using our model, cube initialization speed has increased by 28.5% 
points and 65.8% points space are saved, comparing with the cube building model 
of Kylin. 


Keywords: Distributed OLAP - Distributed query processing system - Kylin 
Query log - Materialization strategy 


1 Introduction 


In the era of big data, many modern companies produce huge amounts of data in their 
service lines. These data are used to conduct report analysis based on OLAP analysis. 
In order to conduct report analysis, companies need a system which can response to the 
query of thousands of data analysts at the same time. That requires high scalability, 
stability, accuracy and speed of the system. In fact, there doesn’t exist a widely-accepted 
method in distributed OLAP field. Many query engines can also conduct report analysis, 
such as Presto [4], Impala [2], Spark SQL [14] or Elasticsearch [10], but they are more 
emphasis on data query and analysis. As a matter of fact, Kylin [7] is the specialized 
tool in Distributed OLAP field which is used often. 

Kylin is originally developed by eBay, and is now a project of the Apache Software 
Foundation. It is designed to accelerate analysis on Hadoop and allow the use of SQL- 
compatible tools. It also provides a SQL interface and supports multidimensional anal- 
ysis on Hadoop for extremely large datasets. Kylin can reach the scale of one million or 
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even millisecond OLAP analysis. So it is very frequently-used in the domestic IT 
industry. 

The idea of Kylin is not original. Many technologies in Kylin have been used to 
accelerate analysis over the past 30 years. These technologies involve storing pre-calcu- 
lated results, generating each level’s cuboids with all possible combinations of dimen- 
sions, and calculating all metrics at different levels. Essentially, Kylin extends the 
methods of traditional OLAP field to the distributed field, generating Cube on Hadoop 
ecology. 

When data becomes bigger, the pre-calculation processing becomes impossible even 
with powerful hardware. However, with the benefit of Hadoop’s distributed computing 
power, calculation jobs can leverage hundreds of thousands of nodes [9]. This allows 
Kylin to perform these calculations in parallel and merge the final result, thereby signif- 
icantly reducing the processing time. 

Data cube [5] construction is the core of Kylin, it has two characteristics: one is the 
exponential growth of cuboids [5] along with the growth of dimensions; the other is the 
large amount of IO due to increased number of cuboids. The cube is usually very sparse, 
the increase of sparse data will waste a lot of computing time and memory space. 

A full n-dimensional data cube could contains 2” cuboids [5]. However, most of 
cuboids are not used, because most of query requested by data analyst follow the normal 
distribution. That’s a waste of IO and memory. 

In this paper, we propose a self-adaption cube building model which adopts a method 
called lazy-building cuboids and abandons useless cuboids based on query log. It can 
reduce the cube construction time and cube size a lot to save IO and memory. The paper 
is structured as follows. In Sect. 2, we present the background. In Sect. 3, we introduce 
the design and implementation details of the self-adaption cube building model. In 
Sect. 4, we focus on experimental evaluation. Finally, in Sect. 5, we discuss the Self- 
Adaption Cube Building Model and give a summary of the paper. 


2 Background 


2.1 Cube Calculation Algorithm 


There are several strategies of data cube materialization [11] to reduce the cost of aggre- 
gation calculation and increase the query processing efficiency including iceberg cube 
calculation Algorithm [3], condensed cube calculation Algorithm [15], shell fragment 
cube calculation Algorithm [13], approximate cube calculation Algorithm [17], and 
time-series data stream cube calculation Algorithm [6]. They are all based on Partial 
Materialization [16], which means that a data sub-cube is selected and pre-calculated 
according to specific methods. Partial Materialization is a compromise between storage 
space, cost of maintenance and query processing efficiency. 

In the process of iceberg cube calculation, sub-cubes which are higher than the 
minimum threshold are aggregated and materialized. Beyer proposed BUC algorithm 
[12] for iceberg cube calculation, which is widely-accepted. 

According to the order of cuboid calculation, the methodologies of aggregation 
calculation can be divided into two categories: top-down and bottom-up. 
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1. Top-Down: Firstly, calculate the metric of the whole data cube, and then the recur- 
sive search is performed along each dimension. Secondly, check the conditions of 
the iceberg, prune branches that do not meet the condition. The most typical algo- 
rithm is BUC algorithm, it perform best on sparse data cube. 

2. Bottom-Up: Starting from the base cuboids, compute high level cuboid from the low 
level cuboid in the search grid according the parents-children relationship. Typical 
algorithms are Pipesort algorithm, pipehash algorithm, overlap algorithm and 
Multiway aggregation algorithm [18]. 


However, Kylin doesn’t follow the principle of partial materialization. In order to 
reduce unnecessary redundant calculation and shorten the cube construction time, Kylin 
adopts a Method called By Layer Cubing, which is a distributed version of the Pipesort 
algorithm, a kind of bottom-up algorithm [1]. 


2.2 By Layer Cubing 


As its name indicates, a full cube is calculated by layer: N dimension, N-1 dimension, 
N-2 dimension, until O dimension; Each layer’s calculation is based on it’s parent layer 
(except the first, which base on source data); So this algorithm need N rounds of 
MapReduce running in sequence [8]; In the MapReduces, the key is the composite of 
the dimensions, the value is the composite of the measures; When the mapper reads a 
key-value pair, it calculates its possible child cuboids; For each child cuboid, remove 1 
dimension from the key, and then output the new key and value to the reducer; The 
reducer gets the values grouped by key; It aggregates the measures, and then output to 
HDEFS; One layer’s MR is finished; When all layers are finished, the cube is calculated. 
The following Fig. | describes the workflow: 


4 ur 
Aur 


2-D Cuboid 





Mur 
Ar 
Mmr 


Fig. 1. By layer cubing. Each level of computation is a MapReduce task, and serial execution. 
A N dimensional Cube needs N times MapReduce Job at least. 
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It has some disadvantage: 


1. This algorithm causes too much shuffling to Hadoop; The mapper doesn’t do aggre- 
gation, all the records that having same dimension values in next layer will be omitted 
to Hadoop, and then aggregated by combiner and reducer; 

2. Many reads/writes on HDFS: each layer’s cubing need write its output to HDFS for 
next layer MR to consume; In the end, Kylin need another round MR to convert 
these output files to HBase HFile for bulk load; These jobs generates many inter- 
mediate files in HDFS; 


All in all: the performance is not good, especially when the cube has many dimen- 
sions. 


2.3 By Segment Cubing 


In order to solve these shortcomings above, Kylin develops a new cube building algorism 
called by segment cubing. The core idea is, each mapper calculates the feed data block 
into a small cube segment (with all cuboids), and then output all key/values to reducer; 
the reducer aggregates them into one big cube segment, finishing the cubing; Fig. 2 
illustrates the flow; 


mapper mapper mapper 


Data Split DELGES Olli DELGES Olii: 


Merge Sort 
(Shuffle) 


Final Cube 








Fig. 2. By segment cubing. 


Compared with By Layer Cubing, the By Segment Cubing has two main differences: 


1. The mapper will do pre-aggregation, this will reduce the number of records that the 
mapper output to Hadoop, and also reduce the number that reducer need to aggregate; 
2. One round MR can calculate all cuboids; 


Based on the work mentioned above, we take advantage of both two Algorithm, and 
optimize cuboid materialization strategy. 
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3 Design and Implementation 


In this section, we first introduce the architecture of Self-adaption Cube Building Model 
(SCBM) and the overall workflow. Then we explain cuboids Lazy-Building and the 
cuboid spanning tree. Finally, we describe the implementation details of the Materialized 
View Self-Adjusting Algorithm. 


3.1 Architecture of Self-adaption Cube Building Model 


The overall architecture of self-adaption cube building model is illustrated in following 
Fig. 3. 
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Building 
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Tigger update condition 
Data cube 
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module 


Read cube 
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building module 
Import data source 


Fact table update 


Fig. 3. Architecture of self-adaption cube building model. 


Self-adaption cube building model takes fact table [5] as input of the overall system, 
usually the fact table is managed by distributed Data Warehouse Hive. We first set the 
parameters of the cube model, such as the filed for analysis, the base cuboid level and 
so on, then we build base cuboids in a mapper-reduce. After the construction of the base 
cuboids, the system can support query request, Query execution engine [7] resolves the 
query to find the required cuboids. If the cuboid has been generated, the query will be 
executed; if the cuboid is missing, the Lazy building module will be triggered to build 
the cuboid using the method in Sect. 3.2. When the query result returns, the system 
records the query log and waits for the adjustment of cube launched by Self-Adaption 
module according to the Materialized View Self-Adjusting Algorithm explained in 
Sect. 3.3. At the same time, the system maintains a dynamic cube spanning tree to store 
the metadata of cuboids. 
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3.2 Cuboid Spanning Tree and Lazy-Buliding 


Cuboid Spanning Tree. In original By Layer Cubing, Kylin calculates the cuboids 
with Broad First Search (BFS) order, which causes a waste of memory. On the contrary, 
Cuboid Spanning Tree generates cuboid with Depth First Search (DFS) order to reduce 
the cuboids that need be cached in memory. This avoids unnecessary disk and network 
I/O, and the resource Kylin occupied is highly reduced; 

With the DFS order, the output of a mapper is fully sorted (except some special 
cases), as the row key of cuboid is composed of cuboid ID and dimension values like 
[Cuboid ID + dimension values], and inside a cuboid the rows are already sorted. Since 
the outputs of mapper are already sorted, Shuffles sort would be more efficient. 

In addition, DFS order is vary suitable for cuboid’s lazy building. Cuboid spanning 
tree also record the metadata of cuboids in every node in the tree, it provides the basis 
for the selection of ancestor cuboids. 


Lazy-Building. Lazy-Building is a basic concept of the model. In order to reduce the 
number of cuboids, we adopt the strategy of generating on demand. At the same time, 
we persist all cuboids on the low layer in the By-Layer cubing algorithm for higher speed 
and lower computational complexity of Lazy-Buliding. For example: a cube has 4 
dimensions: A, B, C, D; Each mapper has 1 million source records to process; The 
column cardinality in the mapper is Card(A), Card(B), Card(C) and Card(D). The Lazy- 
Buliding 1s demonstrated in Fig. 4. 


Query hit Select avg(measure i) 
the cuboid from table group by C 










A B R: 
AB AC... AD B, BDD 
re ae 
A, “C,D 
A,B,C,D 
L J 


Fig. 4. Lazy-building and ancestor cuboids selection. 
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1. User set a base-layer parameter in cube model info to control the scale of base 
cuboids layer. If this parameter is not set, it will use default value log 
(dimensions) + 1 

2. Base cuboids building module import data from fact table and build the base cuboids 
by Cube Build Engine in Kylin. 

3. Update the Cuboid Spanning Tree and save the metadata. 

4. Client launch a query select avg (measure 1) from table group by C which hit the 
missing cuboid [C]. Then, lazy building module receive request to build cuboid [C]. 

5. Lazy building module find a cuboid generation path according to Ancestor Cuboids 
Selection and build the missing cuboid to response the query as soon as possible. 

6. Record the path and determine whether to build all the cuboid on the path at low 
load according to Materialized View Self-Adjustment Algorithm. 


Ancestor Cuboids Selection. When the needed cuboids is missing, we should select 
an ancestor cuboid and a cuboid generation path. The basic principle is to choose the 
ancestor cuboid whose measures are the least to aggregate, which means we can get the 
minimum amount of computation and time to generate the missing cuboid. After that, 
we need to find a path P from ancestor cuboid to the missing cuboid in compliance with 
the Minimum cardinality principle. 

For example, in the Fig. 4. In order to generate the missing cuboid [C], we firstly 
find all the candidate cuboids [A B C] [A C D] [B C D]. Then, we compare the size of 
the three candidate cuboids. Assuming [B C D] is selected, we generate [C] by aggregate 
[B C D] on dimension B, D. The cube is enough to response the query. However, for 
the sake of maintenance of cube according to By Layer Cubing, we need to find a path 
from [B C D] to [C]. 

When aggregating from parent to a child cuboid, assuming that from base cuboid [B 
C D] to 1-dimension cuboid [C], There are two paths: [B C D] [B C] [C] and [B C D] 
[C D] [C]. We assume Card(D) > Card(B) and the dimension A is independent with 
other dimensions, after aggregation, the cuboid [BCD] s size will be about 1/Card(D) 
or 1/Card(B) of the size of base cuboid; So the output will be reduced to 1/Card(D) or 
1/Card(B) of the original one in this step. So we choose the first path, the records that 
written from mapper to reducer can be reduced to 1/Card(D) of original size; The less 
output to Hadoop, which means less I/O and computing and the model can attain better 
performance. 


3.3 Materialized View Self-Adjustment Algorithm 


Self-adaption module adjusts the cube according to the Materialized View Self- 
Adjusting Algorithm. This chapter proposes a query statistics method which takes fixed 
times of queries as a Statistical period, and this method updates the corresponding query 
statistics. This method adjusts materialized views set according to the threshold of elim- 
ination and generation, stabilizes the query efficiency, and minimizes the shake of mate- 
rialized view. 
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Query Statistics Method. A kind of Statistics Method for query. 


Definition 1: Materialized view adjustment cycle T. 
The materialized view adjustment cycle can be customized to a fixed number of 
queries, for example, every 100 queries for a materialized view adjustment cycle. 


Definition 2: Average query statistics E (T, (q;) ). 

Since the actual query may change over time, the query set should also be adjusted 
accordingly. For example, a query that has not been executed in a couple of cycles should 
be removed from the query collection and the corresponding materialized view is 
deleted. 

After many queries, the query log will accumulate a certain amount of query records, 
this paper presents a query statistical method based on the query log which described in 


the following. If there is a query set Q = fq, "1 EE: \ and the query log set L, scan 
forward the log file from the ending of the log file and determine whether there is a query 


qi in the T, cycle, and update EF (T, (q;)) according to Eq. 1: 


E(T,,(4:)) — f a 4 aren (1) 


In the formula, a is a weighted coefficient, a constant; L(T n is the query set in the 
T „cycle. By this method, we can monitor the change of the query set Q, which can greatly 
reduce the shake of materialized view. 


Materialized View Self-Adjustment Algorithm. The main steps of the materialized 
view set adjustment with the query changes are listed as follows: 


1. Prior to the adjustment, initialize materialized view set M = {base cuboids}, the 
corresponding query task set to Q. 

2. During the query, the query is written into the query log L, and the query counter is 
accumulated 

3. Set the threshold of elimination T and the threshold of generation S, update the 
Average query statistics each life cycle 7,, and determine whether eliminate or 
materialize corresponding views. 


The pseudo code of Materialized View Self-Adjustment Algorithm is showed in 
Algorithm 1. 
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Input: Query Log L, Materialized View M, Query Task set 
Q; Materialized view adjustment cycle Ins Threshold Of 
Elimination T, Threshold of generation S$, Path set from 
ancestor cuboid to the missing cuboid P 

Output: Materialized View after Adjustment M 

1 Gets the current query count value count; 

2 if count % Ta then 

3 for j= 1; j STa; j+ do 

4 scan forward the log file from the ending of the 
log file L(I,); 


5 update query task set Q according to L(T,) and P(T,); 
6 end- 

7 end 

8 update E(Tn (qi)) according to formula 2; 


9 for each q; in @-do 
10 if E(T,(q;)) = S then 


11 materialize views m corresponding to qj; 
LZ M.add(m) ; 

13 end 

14 elseif E(T(q))) < T then 

LD eliminate views m corresponding to gr 
16 M.delete(m) ; 

17 end 

18 end 


19 Return M; 
Algorithm 1. Materialized View Self-Adjustment Algorithm 


In the above algorithm, from line 1 to line 8, it scans the query log in a statistical 
period T „ and update the query task set Q during the scanning. From line 9 to line 17, it 
iterate around query in Q, and determine whether eliminate or materialize corresponding 


views according to the comparison of threshold and the E (T, (q;) ) calculated by 
formula 2. Suppose query task set Q contains k different query, then the time complexity 


of the algorithm is O (T, + k). 


4 Experimental Evaluation 


4.1 Dataset 


To test performance, we use the standard weather dataset from the China Meteorological 
Data network. The dataset contains 4726499 weather records from China’ s 2170 distinct 
counties started from January 1, 2011 to January 1, 2017. The original dataset is too 
complicated. In order to better conduct the experiment, we select eight dimensions: 
Province, city, county, date, weather, wind direction, wind speed, air quality level and 
two measures: Maximum temperature, Minimum temperature. 
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4.2 Evaluation Metrics 


We use cube first construction time, average query time and cube size as the evaluation 
metrics of our proposed method. 

Cube First Construction Time refers to the base cuboids building time for self- 
adaption cube building model. 

Average Query Time is defined as the average query time during materialized view 
adjustment cycle T-T, The detailed calculation is listed in Eq. 2. 


‘ 1 n Mr, ‘ 
Average Query Time = NUM(O) pae ae response time(q;) (2) 


Cube Size refers to the disk allocation that the whole cube takes up. 


4.3 Experimental Results 


We first compare the metric of cube first construction time. Because the parameter of 
base-layer has a great impact on this metric, in order to reflect the average condition, 
we use the default value log (dimensions) + 1. We test the model 5 times and the results 
were aggregated to calculate averages which can avoid the impact of MapReduce failure. 
Results can be seen from Table 1, the time consumption of the new model is reduced 
by 28.5% (Fig. 5). 


Table 1. Cube first construction time 


Test result Average 
building model 
Self-adaption cube 65.2 min 
building model 


105.27 








Average query response time (s) 
a 
O 


.43 1.23 


1.041.56°/71.11  1.210.888-421.720.720.411.210.72 0.34 1.120.92 
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In query time, we set Materialized view adjustment cycle T, 50 and test 30 cycles T, 
—T,,. We can observe that Cuboid hit rate and query response time significantly increased 
and improved along with the increase of query requests. Finally, the query efficiency of 
the two models are almost on a par. 

In cube size, we see that the curve that represents this metric tends to be stable after 
vibration in prophase from Fig. 6. Finally, the spaces consumption of the proposed model 
was reduced by 65.83%. 
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Fig. 6. Average cube size trends in T,—T30. 


5 Conclusion 


We have presented a Distributed Self-Adaption Cube Construction Model Based on 
Query Log and applied it to a weather dataset to test its performance. Our model adopts 
a special partial materialization strategy and it can automatically adjust the cuboid set 
which is used in query request according to query log. Based on experimental results, 
the proposed model can reduce the cube construction time and cube size to a great extent 
at the expense of tiny query efficiency reduction. However, this model has good 
performance only when the query distribution is relatively concentrated. So users can 
choose either of the two models according to their practical business query scenario. 
Overall, the proposed model is of great practical significance in the application of BI 
tools. In the next stage, we will optimize the base cuboids generation strategy to reduce 
the query latency in prophase. 
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Abstract. As the number of IoT devices is exponentially growing, and 
IoT networks are expanding in their size and complexity, timely device 
discovery is becoming a pressing concern. The extreme (and constantly 
growing) number of network nodes, dynamically connecting to and dis- 
connecting from a network, renders existing routing techniques, such 
as multicasting and broadcasting, unscalable, especially when using the 
IPv6 128-bit addresses. To address this limitation, this paper discusses 
the potential of implementing the IoT device discovery, based on device 
properties, such as type, functionality, location, etc., and presents an 
approach to enable property-based access to IoT nodes using Bloom fil- 
ters. The proposed approach demonstrates space- and network-efficient 
characteristics, as well as provides an opportunity to perform device dis- 
covery at various granularity levels. 





Keywords: Bloom filter - Internet of things - Edge computing 
Device discovery 


1 Introduction 


In the IoT context, a user (or an application) using hundreds or thousands 
of devices has to deal with considerably long addresses to uniquely identify 
and refer to network devices. This exponential growth is expected to introduce 
new challenges to traditional computer network protocols, such as, for example, 
(i) efficient access to a huge number of devices; (ii) security and privacy; (iii) 
interoperability and standardisation; (iv) efficient energy consumption. More- 
over, given the extreme amounts of heterogeneous devices constituting the loT 
ecosystem, timely and accurate device discovery based on some specific parame- 
ters such as device type, sensing/actuating capabilities, status, powering options 
is also seen as a pressing and challenging issue. In this light, a potential way to 
enable device discovery in the IoT, taking into account the size and complex- 
ity of underlying networks, could be to include additional parameters in the 
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routing procedure with a goal to limit the search space. More specifically, a 
potential solution would be able address IoT network devices not only through 
their IP addresses, but also through a combination of device properties, such 
as their type, location, sensing/actuating capabilities, available resources and so 
on. From this perspective, an envisaged solution could implement some kind of 
selective routing algorithm, which would facilitate time- and network-efficient 
device discovery in the IoT context. For instance, this will allow collecting infor- 
mation from specific sensing devices (e.g. within a single building) and apply 
actuation commands (e.g. turn on heating in rooms with low temperature) in a 
selective manner. Similarly, remote device management and maintenance would 
also become feasible, as users would be able to diagnose errors and patch required 
devices with corresponding software updates (e.g. keeping a surveillance system 
up to date with most recent system security updates). 

Taking into considerations these desired features of a possible solution, this 
paper presents an approach to facilitate property-based device search and discov- 
ery in complex IoT networks using counting Bloom filters. As it will be explained 
below, the proposed approach benefits from the space-efficient way of storing 
information about devices and their properties, as well as fast calculation times 
when deciding whether a matching device is present in the network. Moreover, 
with property-based search using Bloom filters, it becomes possible to perform 
device discovery at various granularity levels. 








2 Background: Bloom Filters 


A Bloom filter, originally defined by Bloom in 1970 [1], is a space-efficient prob- 
abilistic data structure, representing a set S of m elements using an array of n 
bits B = (B[1],...,Bln]) initialised to 0. The filter uses a set of k independent 
hash functions H = {hj,...,h,} with a range {1,...,n} uniformly mapping each 
element of S to a random position over the B array. More specifically, for each 
element s € S, the bits B[h;(s)] are set to 1 Vi|1<i<k. A bit can be set to 1 
multiple times either through different hash functions for the same element s or 
different elements of S. As a result, an answer to the query ‘Is b € S?’ is true, if 
all h;(b) are set to 1, otherwise (i.e. if at least one bit is 0), b is not in S. While 
the Bloom filter has many advantages, such as fast access time and a relatively 
small size (a few bytes per element at most), it may suffer from a possibility 
of false positive results on membership checks. A false positive occurs when the 
hashes from an element not in the Bloom filter overlap with a combination of 
hashes from elements that are in the Bloom Filter. 

Deleting elements from a Bloom filter cannot be done simply by changing 
ones back to zeros, as a single bit may correspond to multiple elements. To 
enable deletion of elements, the so-called counting Bloom filter uses an array 
of n counters instead of bits. These counters are able to ‘track’ the number of 
elements currently hashed to that location [3]. Deletions can be safely done by 
decrementing the relevant counters. A standard Bloom filter can be derived from 
a counting Bloom filter by setting all non-zero counters to 1. 
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Bloom filters were originally introduced to improve data management per- 
formance, and quickly became popular in a variety of databases and storage 
systems |1|. Then, they have been widely used in distributed systems [6] across 
a wide range of application domains. In the recent years, Bloom filters experi- 
enced an increased interest by the networking and security domains [2]. More 
specifically, they are also widely applied in intrusion detection, virus and spam 
detection, access control [4], and IP traceback [5]. 


3 Proposed Approach 


The proposed approach is based on a two-step procedure. First, all the interme- 
diate nodes in the network hierarchy are populated with information about edge 
nodes. Second, once the network is populated, an IoT device can be discovered 
using a corresponding query. 


3.1 Populating the Network 





An IoT device in a network hierarchy can be represented as a tuple D = 
(ID, Prop), where ID is a unique identifier of this device, and Prop is a set 
of properties of this device, such as, for example, type (e.g. CCTV camera, 
environmental sensor, smartphone, etc.), sensing capabilities (e.g. temperature, 
pressure, noise, acceleration, etc.), power supply (e.g. solar panel, battery, power 
cord, etc.), manufacturer, model, production date, and so on. Using suitable hash 
functions, each set of device properties Prop is converted into a corresponding 
Bloom filter array. Next, the network hierarchy is ‘populated’ by these newly- 
created Bloom filters in a bottom-up manner — i.e. edge devices provide its 
immediate subnet gateway with their Bloom filter representations, which are 
summed up in a single Bloom filter by doing the bitwise OR operation. This 
process is then iteratively repeated up until the very top of an IoT network hier- 
archy. As a result, the top-level server’s Bloom filter eventually contains Bloom 
filters of all individual edge nodes in its network. 











3.2 Device Discovery 


At the device discovery step, Bloom filters are used to represent correspond- 
ing discovery queries. More specifically, each query is represented by a tuple 
Q = (ID, Prop), where ID is a unique identifier of this query, and Prop is a 
set of device properties, which are expected to be discovered within the given 
network. These properties can be seen as search parameters, as typically used in 
traditional searching. Each query is then represented by a corresponding Bloom 
filter, using the same hash functions. Device discovery can be seen as a reverse 
process of populating the network hierarchy, executed in a top-down manner 
starting from the very top of the network topology. By performing the bitwise 
AND operation, the top-level server first checks with its own Bloom filter whether 
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there is a device, matching query parameters, in its managed network. Accord- 
ingly, if the evaluation is true, the query is sent down to lower-level network 
gateways, which similarly check whether a suitable device is present in their 
subnet. This process iterates in each subnet either (i) until reaching the very 
bottom level and a suitable device is discovered, or (ii) until one of the interme- 
diate network nodes replies that no matching device is present in its subnet. 





3.3 Sample Scenario 


The simplified use case scenario assumes each device (and its properties) is rep- 
resented by a 6-bit Bloom filter. From left to right, these bits indicate whether a 
device is: a camera (1), an environmental sensor (2), a smartphone (3), battery- 
powered (4), powered by a solar panel (5), or powered by a cord (6). It is also 
assumed that there are three subnets in the network, each containing three 
devices (Fig. 1). The three network gateways contain combined Bloom filters of 
their respective subnets, and the server contains the overall Bloom filter rep- 
resentation of the network. The goal of this scenario is to discover a camera, 
powered by a cord. Accordingly, the query is represented by the following Bloom 
filter BF = (1,0,0,0, 0,1). 

At the first step, the server evaluates the query against its own Bloom filter, 
and decides that there is indeed a matching device present somewhere down the 
network. Next, the query is propagated down to three subnets. The respective 
gateways start evaluating the query against their own Bloom filters. As it is seen 
from the diagram, Subnet A contains only smartphones, and Subnet B contains 
only environmental sensors. The query evaluation returns false, and, as a result, 
the network call is not propagated down the first two subnets. Gateway C, by 
evaluating the query, understands that there is a matching device in its subnet 
and sends the query to all three nodes. Two of these nodes are cameras, but 
only one of them — Device C3 — is actually powered by a cord. By evaluating 
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1 1 1,1 1 1 


Subnet A Subnet B Subnet C 
O 0.1 1 1 1 O 1 0 0 1 0 1.1. 0 O 1 1 
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0/0/1/01/011 O 1 0 0O 1 0 1 0 0 0 O 1 
Device A3 Device B3 Device C3 


Fig. 1. Property-based network device discovery using Bloom filters. 
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the incoming query, it realises that it matches query parameters, and replies 
back with its ID and network location. The reply is then sent back to the server 
through intermediate hops. 

The presented property-based search for IoT devices enables flexible, fine- 
grained discovery of IoT nodes. The more properties are specified in the query, 
the more precise the search is and less matching devices are discovered. On 
contrary, for a single specified property, the search space is expected to be wider, 
since more devices might satisfy the search query parameter. For example, in the 
simplified scenario above, searching for a solar panel-powered device will yield 6 
results. 




















4 Discussing the Benefits 


This paper presented an approach for property-based network discovery of IoT 
nodes using Bloom filters. Potential benefits of the proposed solution can be 
summarised as follows: 


Flexible device discovery at different levels of granularity: as opposed to the 
traditional access to edge nodes in IoT environments, where IP addresses need 
to be known in advance, the proposed approach enables searching for devices 
based on their properties, such as type, sensing capabilities, powering options, 
etc. This kind of property-based device discovery can be performed at various 
levels of granularity — i.e. coarse-grained (e.g. discover any kind of camera within 
a network) or fine-grained (e.g. discover an outdoors CCTV camera with high 
resolution, powered by a solar panel). This flexibility has the potential to con- 
tribute to creation of a wide range of loT systems, where the network topology 
is not static, but rather devices are constantly joining and leaving the network. 
Moreover, the property-based search paves the way for interchangeable IoT archi- 
tectures, in which individual elements are described in terms if their features and 
functionalities. This way, one element can be substituted by a similar one based 
on their shared properties, in a seamless, transparent manner. 











Network efficiency: a Bloom filter (as suggested by its name itself) serves to 
filter incoming search queries to avoid redundant broadcast calls through the 
whole network. If an intermediate node understands that there is no matching 
device within its subnet, it does not allow the query to go down that specific 
subnet, thus (i) decreasing the amount of time needed to discover a device, and 
(ii) minimising the amount of redundant network traffic and improving network 
latency. Moreover, the query evaluation procedure — i.e. performing the bitwise 
AND operation on two bit arrays — is a time-efficient operation with minimum 
impact on the overall device discovery process. In the presence of hundreds and 
thousands of edge devices and intermediate network nodes, both network- and 
time-efficiency is seen as considerable benefits when discovering devices (even 
taking into consideration the false positive rate). 

















Space efficiency: Bloom filters are space-efficient data structures, requiring min- 
imum amount of memory. Even when using counting Bloom filters, resulting 
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arrays typically do not exceed 4 MB of storage space — a highly-relevant feature 
in the context of IoT environments, where individual nodes are not necessarily 
equipped with mass storage facilities. 


High accuracy: despite potential false positive results, which may occur during 
the device discovery process at intermediate network nodes, the overall accuracy 
is not affected, since a final decision whether a matching device is present in 
the network or not is taken by edge devices themselves. Only if an edge device’s 
Bloom filter matches an incoming query, a corresponding acknowledgement is 
sent back to the server. Otherwise, it is assumed that no matching devices were 


identified. 











Scalability and extensibility: thanks to the ability to store large amounts of 
hashed values and high calculation speed, a Bloom filter can be updated with 
new elements with a minimum effect on the overall performance. This is espe- 
cially important in the context of IoT networks, which are already constituted 
by millions of devices, and keep on growing in their size and complexity. 





5 Conclusions 





The presented solution enables flexible property-based device discovery in the 
context of complex IoT networks using Bloom filters. As opposed to the IPv6 
routing, which requires 128 bits to encode an address, the proposed approach 
benefits from the space-efficient way of representing and storing data using a 
Bloom filter. This also contributes to decreased traffic and network latency, as 
the device discovery duration depends on how narrow-focused a search query is 
(i.e. the less devices matching the query, the less network traffic is generated). 
The Bloom filter decides whether a device belongs to a subnet branch or not, 
and can ‘cut off’ the entire branch before actually checking it. As a result, this 
considerably reduces the amount of network traffic, especially when compared 
to broadcast and multicast routing techniques. 
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Abstract. There are many researches on psychological text analysis, and it has 
been proved that the words people use can reflect their emotional states. In this 
paper, we introduce how to analyze the psychology of the characters in vernacular 
novels automatically. First, we process the dialogs with word segmentation, and 
analyze the segmented text with SC-LIWC. Then, a vector reflecting the 
psychology of the character is obtained and we map it to the big five. Finally, 
taking the dialogs of the Journey to the West as corpus, We have got the person- 
alities of four main characters which are verified to be same as some famous 
comments of the Journey to the West, which shows that our work is effective. 


Keywords: LIWC- Vernacular - The Journey to the West - The big five 
Text analysis 


1 Introduction 


The use of words in the text can reflect the individual’s psychological state and person- 
ality [1]. Linguistic Inquiry and Word Count is a tool which we can use to analyze text. 
The way that the Linguistic Inquiry and Word Count (LIWC) program works is fairly 
simple. Basically, it reads a given text and counts the percentage of words that reflect 
different emotions, thinking styles, social concerns, and even parts of speech. 

To date, LIWC has been applied to many psychology research. It is often used to 
examine suicide writings in order to characterize the quantitative linguistic features of 
suicidal texts, in [2], the authors analyze texts compiled in Marilyn Monroe’s Fragments 
using LIWC, in order to explore the contact between the use of different linguistic cate- 
gories over the years and her suiside. The result is coincide with different theories of 
suicide. López-López et al. [3] analyzed the StackOverflow’s answers and questions to 
explore the users’ personality traits. They found that the top reputed authors are more 
extroverted than general users. Moreover, authors who got more votes express signifi- 
cantly less negative emotions than those who got less votes. Markovikj et al. [4] explored 
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the modeling feasibility of user personality based on the features extracted from Face- 
book. In [5], they collected a sample of 363 participants, including their written self- 
introductions and final course performance, the result shows that course performance 
could indeed be predicted by the word usage of linguistic categories. 

LIWC for Traditional Chinese, TC-LIWC, is published with the authorization of 
Pennebaker by Huang et al. (2012). After that, SC-LIWC for Simplified Chinese [6, 7] 
is published on the basis of TC-LIWC, which lays the foundation for the following 
research [10, 12]. 

Recently, some researchers are concerned about automatic personalistic prediction 
using liwc. Personality is stable in a period of time, so a collected corpus from several 
months is suitable for this research. Gao [12] selects 1766 participants, first make them 
fill in a big five inventory for comparison, and then collect their weibo through the API 
of Sina. 90% of the samples are trained using liwc and the rest act as test set. In the 
training stage, they compute the Pearson’s coefficient between the inventory and the 
training results, then choose features which behave well in the training. At last, the 
features are composed to predict personality of test set. They compute the Pearson’s 
coefficient as before, and the results are between 0.3~0.4. While the coefficient between 
self-rating and rating by observers is about 0.5, hence the method has prediction ability 
to some extent. 

We will seek method to predict the personality of characters in novels written in 
vernacular. Vernacular is a written language with some artistic processing. It is easy to 
read, but still have some features of ancient Chinese. Vernacular is generally used for 
literacy, especially in the novels. Vernacular novels are very popular from the beginning 
of Ming Dynasty. Three of the four famous Chinese novels were accomplished in Ming 
Dynasty. After that, vernacular novels were more and more popular. There are many 
excellent ancient books in China, which created numerous virtual characters, a book 
named A Dream of Red Mansions only, contains hundreds of characters. We will pay 
a lot of time to read books, look up in the library, to understand these figures and step 
into the author’s inner world. If we can analyze the characters of the books automatically, 
it will save us much time and help us follow the books. 

In this study, we use LIWC to analyze the personality of the characters in vernacular 
novels. The process of personality analysis of the characters in the vernacular novels is 
shown in Fig. 1. The rest of this paper is organized as follows. Section 2 will discuss 
the preprocessing work for the novel. And then in Sect. 3, we will split the dialogs 
obtained from Sect. 2, which will be used to analysis the personality of the characters 
and get the big five of characters in Sect. 4. Finally, we also present the personality 
change of Sun Wukong before and after the three strikes of White Bone Demon. 
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Fig. 1. The flow chart of automatic personality analysis of vernacular novels. 


2 Data and Data Preprocessing 


The Journey to the West is the first ancient Chinese romantic novel. The book deeply 
depicts the social reality of the time, mainly describes the origin of Tang priest, Sun 
Wukong, Zhu Bajie, Sha monk, and together with the story of pilgrimage to the west. 
After the spread of centuries, the Journey to the West has been translated into many 
languages, and a number of relevant research monographs have been published, which 
made a high evaluation of the novel. The Journey to the West is known as one of the 
four famous Chinese classics. There are four main characters in the Journey to the 
West. They are Sun Wukong, Sha monk, Zhu Bajie and their master, Tang priest. 

James W. Pennebaker proved that words used in their daily lives could contains 
important information of psychological information [8]. Especially, he proved that not 
only nouns and verbs serve as markers of emotional state, social identity, and cognitive 
styles, particles, serve as the glue that holds nouns and regular verbs together, can also 
do the same things. That means we can study the particles instead of more complex 
methods. On the basis of his work, we decide to use the dialogs of the characters to study 
their personality. We select the dialogues and their inner monologues for each character 
respectively, and put them into 5 different files. There are a lot of descriptive verses in 
the text, we should delete them because they would interface the process of participation, 
and these verses are often said by other people, not the roles themselves, so we believe 
that the descriptive verse have little influence on the personality. Notice that it shouldn’t 
include quotes on the end of the sentence. But other interval will be retained for the next 
step. At last we get four files. 


3 Segmentation 


In order to get the particles for analysis, we should first do the text segmentation. As we 
know, the current methods of Chinese word segmentation can be divided into three 
kinds: the method based on lexicons, the method based on statistics and one based on 
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semantics. Also there are methods mixing two or three of them in order to improve the 
accurate. There have been many kinds of word segmentation systems for modern 
Chinese. For example, LTP-CLOUD, NLPIR, jieba, and so on. These systems all have 
good results in Chinese word segmentation. Among these tools, LTP and NLPIR are 
systematic tools, while jieba only contains segmentation function. Moreover, LTP and 
jieba are open source tools, but NLPIR is not. 

Few studies are involved in ancient Chinese segmentation. First, there is little corpus 
for ancient Chinese. As we all know, segmentation need a lot of corpus which has been 
marked manual to improve the accurate, but no one have done the work for ancient 
Chinese. Second, it seems that we cannot obtain any economic benefits from it. However, 
Hou, et al. have studied ancient Chinese segmentation [9], but the corpus is still very 
small, which couldn’t be generalized easily. 

Vernacular has the features of both ancient Chinese and modern Chinese. Therefore, 
we can refer to the methods for modern Chinese segmentation. 

The punctuation and function words are not changed much over the years. They are 
also used in ancient Chinese. Besides, there are a lot of words that still exist in ancient 
Chinese. In addition, the Chinese word segmentation methods are able to identify new 
words according to statistical methods. 

In order to simplify the segmentation procession, we make a simple test on the 
segmentation of vernacular and find, most words are segmented correctly by applying 
the LTP directly. But there are still many mistakes; notional words are not distinguished 
from others, idioms are segmented wrong, and there are other mistakes generated in the 
algorithm. That is mainly because of the difference between vernacular and modern 
Chinese. There are many words which are not used now, especially those appear only 
several times in the text. 

A simpler and more efficient method we used to solve above questions is add a 
manual dictionary to the LTP. The dictionary includes notional words and some idioms. 
Notional words include place name, monster name, Tang priest and his three appren- 
tices’ name and nickname, gods’ name and their nickname, the particular items and some 
words about the emperor and the dynasty. 


e Place name. There are a lot of places made up by the author, such as the monster’s 
cave, the god’s mountain, the monkey’s birthplace, and so on. Take these words into 
dictionary will make sure they are segmented correctly. 

e Monster’s name. Tang priest and his four apprentices meet many big monsters on 
the Journey to the West, and each of them get a nickname, even some of small monster 
under them get one, too. To split them correctly, we had better put them into the 
dictionary. 

e Four characters’ name and their nickname. Though only four people, each of them 
have many nickname. Only Tang priest has more than 5 nicknames, for example, 
priest, Tang priest, Xuanzang, elder, Tang elder, master, and so on. Especially they 
often emerge in the dialogs. Therefore, split them from other words are important. 

e Gods’ name and their nickname. When walking through the long way, the four meet 
a lot of gods. Each of them have several nickname, especially Guanyin, his nicknames 
even catch up with Tang priest. 
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e The particular items. There are many particular things in the Journey to the West, 
such as kinds of weapons, treasures, and so on. Much of them are not usually used 
in modesty Chinese. 

Some idioms. Idioms are fixed phrases in the ancient Chinese. 
Some words about the emperor and the dynasty. 


We select 1000 characters randomly from the segmentation results for each of 
the four, five of our workmates check it respectively. When finished all, they vote 
on the contradicted ones. Table | shows that the error rate is about 2%~3%, in other 
words, the accurate rate is about 97%~98%. Though the punctuation is count into the 
words, the error rate will not exceed 1.2 times of the existing data. As we can see, 
this method is quite effective. 


Table 1. The error rate of word segmentation. 


Error Rate 
Tang priest EE 27 0.0318 
Sun Wukong 21 0.0274 
Zhu Bajie 25 0.0239 
Shamonk | 753 |30 0.0398 


In [12], five of six feature classes are properties of Weibo, so in our work, we 
choose only the features of liwc. Due to the differences between ancient Chinese and 
modern Chinese, liwc dictionary will have corresponding change, so will the features 
selected based on liwc. To solve it, we get rid of the features which are not consis- 
tent with ancient Chinese. The rest features will serve as input of the model which 
have been trained in [12]. In order to inspect the personality more intuitive, we adopt 
a commonly used quantitative method in psychology—the big five personality traits 
[11]. In the big five traits model, the user’s personality is abstracted into five dimen- 
sions, which are shown in Tables 2 and 3. The big five score of Tang priest and his 
apprentice is shown in Table 4. 







Table 2. The big five 1. Each of them has 6 facets. 


Agreeableness Conscientiousness Extraversion 


Trust Competence Warmth 
Straightforwardness Gregariousness 
Altruism Dutifulness Assertiveness 


Compliance Achivement striving Activity 
Modesty Self-discipline Excitement seeking 
Tender mindedness Deliberation Positive emotion 
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Table 3. The big five 2. 


Openness to experience | Neuroticism 


Fantasy Anxiety 

Aesthetics Hostility 

Feelings Depression 

Actions Self-consciousness 
Ideas Impulsiveness 

Values Vulnerability to stress 


Table 4. The big five of Tang priest and his three apprentice. 


Open. | Neur. 
Tang priest 3.50 | 25.42 
Sun Wukong 0.92 1.93 
Zhu Bajie 15.08 | 18.27 
Sha monk 1.10 | 26.31 





4 Personality Analysis of Characters 


4.1 Agreeableness 


As an eminent monk from Tang dynasty, Tang priest is very kind and compassionate 
[13]. On the Journey to the West, he tries to help others, though when he is in danger. 
He is modest and subject to authority, such as emperor of Tang Dynasty, and all the 
gods they meet on the Journey to the West. But sometimes he is egoistic, and often shifts 
responsibility. His agreeableness 1s relatively high. 

Sha monk is very careful and slavish since he was surrendered by Sun Wukong. He 
has never been egoistic, doing his best to serve the master and help his brothers [14]. 
Tang priest and Sun Wukong all have deep trust on him. The agreeableness of him is 
highest. 

Sun Wukong is capable, but his master does not trust him. He has helped many 
people, but that does not mean he is willing to sacrifice. If someone harms his interests, 
he will not hesitate to teach him a lesson. Sun Wukong is also an arrogance role, never 
understanding what modesty is. He is not gentle too. In conclude, his agreeableness is 
lowest. 


4.2 Conscientiousness 


Tang priest has no ability to protect himself, and has no experience to deal with monsters. 
He strictly abides by the doctrine, trying his best to protect it. And he has a strong will, 
which makes him go to the west firmly to obtain Mahayana Buddhism [13]. All in all, 
his conscientiousness is relatively high. 
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Though exiled from heaven just because knocked over a glass made of colored glaze, 
Sha monk not only gets no angry, but also accepts the destiny to atone for his sin. In this 
point, he is something like Tang priest [14]. Compared with his brothers, he seems plain, 
but he is better than his brothers in human nature. So his conscientiousness is relatively 
higher. 

Sun Wukong is strongest among his brothers. He pursues the idea that the stronger 
should hold the power, never yielding to authority [15]. Therefore, many gods serve him 
as a servant. He tries his best to protect the master, not for any benefits, but for returning 
Tang priest’s salvation. He could not restrain his aggressive instincts, and that makes 
Tang priest most unhappy. It is not strange that his conscientiousness is lower than his 
brothers. 

Zhu Bajie is similar with Sun Wukong in agreeableness and conscientiousness. There 
is also difference: Zhu Bajie gets more score in tender mindedness, while Sun Wukong 
gets more in altruism [16, 17]. Zhu Bajie obeys the laws. Sun Wukong is a king of 
monkey before he follows Tang priest, so he has no knowledge of it. Thus in order, Zhu 
Bajie gets more score. But he always declares to go back to Gao Village when he 
encounters danger. Sun Wukong loves battle while Zhu Bajie loves women. In total, 
they go head in head with each other in agreeableness and conscientiousness. 


4.3 Extraversion 


In the point of extraversion, Zhu Bajie gets first without question. He is very lustful, 
showing great enthusiasm for women. He loves to eat, too [18]. Each time when they 
arrive to a new town, he is happiest because he can eat a lot. Zhu Bajie is outgoing 
compared with other people. 

Sha monk is upright and honest [19]. He talks little, but what he said is very useful. 
When there is contradiction among the group, Sha monk is the one who tries to solve it. 

Tang priest is kind and friendly when dealing with people, but not enthusiastic. He 
prefers quiet than noisy. So his extraversion is low. 

In Table 4, Sun Wukong is lowest in extraversion. Though his warmth is not as good 
as his brothers, but the rest features should be better than others. The possible reason 
may be the dictionary we use may not be so fit with his dialogs. 


4.4 Openness 


Openness is an indicator of the level of intelligence. From Table 4 we can observe that 
Zhu Bajie gets the highest score. Zhu Bajie is always being called “fool”, but that is not 
the case [18]. He is fond of eating and sleeping, and good at flattering in front of master, 
so Tang priest trusts him very much. In general, he always makes the best decision for 
himself. 

As a leader of the group, Tang priest is well learned and behaved. But he is often 
cheated by monsters and confused by Zhu Bajie, and getting rid of Sun Wukong several 
times, who tries to protect him. 

Sha monk actually is a servant of Tang priest. He is responsible for all trifles, but 
never complaining about it [20]. 
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Sun Wukong is powerful and good at dealing with enemies, however, he is often 
fooled by Zhu Bajie. So in fact, Zhu Bajie is the most intelligent person among the group. 


4.5 Neuroticism 


In the term of neuroticism, Tang priest and Sha monk are similar to each other. When 
facing the danger, Tang priest is anxious and scared, and often bursts into tears [13]. 
Sha monk is puzzled, and ‘What should we do?’ is his pet phrase. Zhu Bajie helps Sun 
Wukong a lot in fighting, but once failed the first thing he thinks of is escaping. Sun 
Wukong never gives up, even if he was alone, he will fights until success. 


5 The Personality Change of Sun Wukong 


We also did a research on the personality change of Sun Wukong before and after the 
three strikes of White Bone Demon with this model. The results are shown in Table 5. 


Table 5. The big five of Sun Wukong before and after his beating the White Bone Demon for 


three times. 
Open. | Neur. 
Before 19.27 | 6.62 
After 1.03 | 1.19 


Beating the White Bone Demon for three times was a turning point of Sun Wukong. 
The author explained the mind change of Sun Wukong by the words of Zhu Bajie. 

Before that, Sun Wukong was very irritable, but under the constraint of Tang priest, 
he changed gradually. He became no more impulsive at all. It is corresponding to the 
decrease of neuroticism. 

In the early stage, Sun Wukong despised the authority, refused to obey the discipline, 
and showed a clear sense of rebellion. While later he was influenced by Tang priest, no 
longer having a strong sense of resistance. As we can see in Table 5, his openness is 
greatly changed before and after the three beats of White Bone Demon. 

The agreeableness is higher after the three strikes. In the respect of getting along 
with people, Sun Wukong changed obviously, especially to Bodhisattva, Buddha, and 
some others who is venerable. He was more and more courteous and no longer as arro- 
gant as before. 

In the early stage, the responsibility of Sun Wukong is not clear, he even tried killing 
Tang priest and attacked the Bodhisattva at first. However, he accepted his duty in the 
late, becoming a qualified defender. 





6 Conclusion 


In this paper, we make a automatic personality analysis of characters in the Journey to 
the West using LIWC, and map the feature vector into the big five to make it easy to 
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observe. We compare the results with many famous reviews, and compare the results 
between characters, and compare the results back and forward. These comparisons all 
show that automatic personality analysis of characters with LIWC is feasible. 
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Abstract. More and more higher education institutions are adopting computer 
based learning management system to boost learning of the students. Networking 
and collaboration through social media platforms are vital realities. Learning is 
not merely limited to class rooms, which is now independent of location and time. 
Understanding how students learn in this realm is a mighty challenge for teaching 
professionals. Fortunately, data is abundantly available through learning manage- 
ment systems and social media platforms. Analyzing this vast data could give an 
insight into how learning is happening in these days. Data mining techniques are 
vastly being used for this purpose. In this paper, we present a statistical analysis 
of e-leaning data obtained from SCHOLAT, a scholar oriented social networking 
system. The analysis aims at getting data oriented perspectives of learning, e.g., 
which factor to what extent impacts learning. The analysis revealed factors which 
positively or negatively affect learning achievement of the students, 1.e., course 
final scores. 


Keywords: Data mining - Academic social network 
Learning management system - Statistical analysis 


1 Introduction 


Self-motivated, self-directed, self-paced and collaborative knowledge construction, this 
is how learning can be described now and in coming future. Learning is not limited to 
class rooms and to individuals. In fact, apart from a fraction, most of learning takes place 
outside of class room settings and through collaboration [1]. Realizing this need, the 
modern e-learning systems are built around “knowledge construction by collaboration’ 
principle. These e-learning systems include features like wikis, forums, discussion 
boards, podcasts, etc., to facilitate collaborative and interactive ‘free’ learning. 
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Students learn more by exchanging knowledge with each other, participating in 
meaningful interactions and learning discourses [2]. Those who are engaged in collab- 
orative and interactive learning activities, are more productive in terms of learning 
achievements [3]. Engaging students in meaningful and productive social interactions 
is a major concern of modern day course instructors, and the vertical social networking 
platforms emerged to respond. These kinds of social networking systems are built to 
meet demand of social interactions for a specific group of users with similar needs, for 
example learners and scholars. Together with the appropriate e-learning technology, this 
system can fulfill the demand of future learning. 

The students who involve in on-line discussions and interactions can easily be 
distracted [4]. This is not what the course instructor has desired, nor may the learners 
themselves have known this. The course instructor should decide which dimension of 
learning process is most appropriate to be supported by social interactions [5]. This is 
not likely to happen until instructors analyze the course very closely. For a course 
supplemented by collaborative activities, the breach between instructor’s intentions and 
students’ response should be well known. By making this sure, is what guarantees a 
sound learning process. 

The e-learning and social networking systems gather vast data as a result of users’ 
interactions with the system, [6] termed it as gold mine of educational data. The data 
mining techniques can be applied to this data to get an insight into the learning process. 
Unlike, traditional face to face learning settings, where data is scarce and available only 
at the end of session, the data gathered through e-learning and social networking systems 
is abundant and is available beforehand. Therefore, course instructor can get useful 
knowledge about the current learning process. The emergence of educational data 
mining and learning analytics fields is response to the growing demand of analyzing the 
vast data produced in e-learning and social networking systems. 

SCHOLAT is a vertical social networking platform built specifically for learners and 
scholars. SCHOALT through its course module also facilitates building blended learning 
environment to promote ‘beyond classroom interactions’ between learners and course 
instructors. In this paper, we present an analysis of data retrieved from SCHOLAT- 
course module. We analyze this data to get an insight of the learning process and to 
know how the students behave. The analysis will help us understand the various 
dynamics of learning process and in future, will lay foundation for building more 
sophisticated algorithms to understand learning in a “blended social collaborative’ 
learning environment. 

The paper is organized in the following way: Sect. 2 describes a brief review of 
related work, Sect. 3 provides an introduction of SCHOLAT, Sect. 4 presents data and 
methods used in this analysis, results of analysis are presented in Sect. 4, the results are 
discussed in Sect. 5 and finally Sects. 6 and 7 present future work and conclusion 
respectively. 
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2 Literature Review 


Numerous studies have been conducted to demonstrate the effectiveness of collaborative 
and interactive learning. In essence, these studies emphasized the very fact that future 
learning is only by cooperative and collaborative means. The presence of modern age 
social media networks and their popularity highly signifies this fact. Although, 
researchers have recognized it quite a while ago, but now it is indispensable. In this brief 
review, we present some of the important studies. 

Social constructivist is a learning paradigm advocating social and collaborative 
learning. This learning paradigm emphasizes on learner-to-learner interactions, knowl- 
edge co-construction and sharing of contents [7, 8]. According to this theory, learning 
process resides in social interactions. And foremost learning activities are group work 
and collaboration. 

The collaborative activities, in fact, are supplement face to face teaching [9]. The 
computer based tools help to vanish spatial and temporal limitations, therefore 
expanding the scope of cooperation beyond classrooms to make anywhere and anytime 
learning possible [10]. Yu et al. asserted that collaborative learning helps in achieving 
desirable learning outcomes [11]. Another study maintained that more collaborative 
students achieve more in terms of their learning outcomes [12]. 

It is the instructor who decides the scope and limits of social interactions and collab- 
oration. The learners are facilitated by engaging beyond classroom activities and avail- 
ability of course related material outside classrooms [13]. A careful design and conduct 
of learning activities ensure a sound learning process. The instructor can analyze vast 
data generated by students’ use of computer based learning system to have a close look 
at the learning process. 

Data mining or knowledge discovery in databases (KDD) is set of techniques applied 
to extract implicit and interesting pattern from large collection of data [14]. Most 
observed data mining method are statistics, visualization, clustering, classification, 
association rule mining, sequential pattern mining, etc. [15]. Whereas, educational data 
mining (EDM) is the application of data mining techniques to a specific dataset coming 
from education environment to address important educational questions [21]. 

Data mining techniques have been used to improve e-learning process [16]. There 
are various uses of data mining of educational data, for example, to explore, analyze and 
visualize data, to find out useful patterns, to discover how students learn, etc. [17—19]. 
Other studies indicate some more uses of data mining, for example, recommending 
activities for students, getting feedback about learning process, evaluating course struc- 
ture, classifying learners according to their learning needs, discovering regular and 
irregular patterns, adoption of computer based learning systems to user need, etc. [15]. 

In a previous study [20], a descriptive statistical and correlation analysis of 
SCHOLAT e-learning data was presented. The analysis disclosed many interesting facts 
about students’ behavior in online learning management system. The analysis revealed 
higher descriptive statistics values of variables representing activities expecting to 
contribute towards final scores and moderate weak correlation between them. Lower 
descriptive statistical value and weak correlation with final scores was observed in case 
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of other variables. The distributional shape were skewed and outliers were present in 
the data. 

However, previous analysis was carried out on a single class and only four variables 
were used (3 independent and 1 dependent). In this paper, we present an extended and 
diverse analysis of SCHOLAT e-learning data. We include four class in this analysis to 
study behavior across different students’ cohorts. Further, we use 11 variable (10 inde- 
pendent and | depended). These variables represent administrative and collaborative 
use of learning management system. By use of statistical and correlation analysis tech- 
niques, the analysis reveals common trends of system usage among different groups of 
students. We use course final scores as criteria of successful learning. We plan to use 
this analysis as base work for developing machine learning models representing 
students’ learning in blended social learning environment. 


3 SCHOLAT-Past and Future 


SCHOLAT is an emerging vertical social networking system designed and built specif- 
ically for scholars, learners and course instructors. It uses un-directed graphs to represent 
social network structure. It is bi-lingual, 1.e., supports English and Chinese. The main 
goal of SCHOLAT is to enhance collaboration and social interactions focused around 
scholarly and learning discourses among community of scholars. In addition to social 
networking capabilities, SCHOLAT incorporates various modules to encourage collab- 
orative and interactive discussions, for example, chat, email, events, news post, etc. 
Table 1 shows a comparison of SCHOLAT with other similar social networking system. 
SCHOLAT definitely has an advantage over them. 


Table 1. Comparison of SCHOLAT and other scholar networking systems 


Function SCHOLAT SocialScholar | eScience Scholarmate 
Data space Yes Yes Yes —— |Yes Yes 


Academic Yes Yes Yes Yes Yes 
Team Yes Yes 
Course No No 
Email No No 
Chat Yes No No 


Course module is a distinctive feature in SCHOLAT. The course module is built 
around blended social learning concept which provides its users, e.g., instructors and 
students, an opportunity to indulge in collaborative and interactive activities beyond 
class room settings. It mainly supports administrative and collaborative activities. For 
example, instructors can create online courses and classes, present course introduction 
and teaching plans on course website, make announcements, assign tasks to students 
like homework, upload course help materials, interact with students via online question 
answer, etc. The students can get up to date information about course in progress, down- 
load course help materials, submit homework, ask questions online, etc. At the time this 
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study is being conducted, 32758 students have been enrolled in 1785 classes of 760 
courses. 

SCHOLAT-course is still developing. Being a part of SCHOLAT scholar social 
networking system opens vast horizons for its future development. We plan to work in 
dimensions of social collaborative learning. We plan to study the factors affecting the 
learning in collaborative social learning environment. This study is start of a journey 
towards this destination. 

In this study, we explore the statistically supported answers of four research ques- 
tions: (1) what is the general pattern of system usage among four classes, (2) course 
administration and collaboration, what type of course support is mostly used by students, 
(3) is the data is normally distributed or not?, (4) which variable(s) is (are) most corre- 
lated with the course final scores. 


Table 2. Variables in previous and current studies 


Variables | Description 


Student level data independent variables 


Number of logins*” The number of times a student has logged into 
the system 

Number of online questions asked** The number of questions asked by a student on- 
line 

Number of on-line reply° The number of replies to all questions posted 
by a student 

Number of homework submitted on time” The number of homeworks submitted within 
deadline 

Number of homework submitted latet The number of homeworks submitted after 
deadline 

Number of homework not hand inf The number of due homeworks not submitted 


by a student 

Course level data independent variables 

Total number of homework“ The number of homework assigned by 
instructor 

Number of hits course notice® The number of times course notices were 
opened 

Number of downloads course resources? The number of times course materials were 
dowloaded by students 

Number of comments‘ The number of comments posted. The 
comments can be posted by students or other 
users 

Dependent variable 

Course final score? The score obtained by student in a formative 
assesment written test 


“Variables included in previous study [20]. 
Variable for system Use. 
“Variable for collaboration. 


d . zau . 
Variable for course administration. 


Digging Deep Inside: An Extended Analysis of SCHOLAT E-Learning Data 415 


4 Data and Method 


In previous study [20], the analysis was performed on one class. In present study, we 
include four classes. We also increase number of variables from four to eleven. These 
four classes were selected from three different courses, course [D165 (C Language 
Programming) class [D279 taught in autumn 2014, course [D520 (Introduction to 
Learning Sciences) class [D1149 taught in autumn 2015, and course [D206 (Software 
Requirement) classes ID1563 and ID1729 taught in spring 2016. The number of students 
enrolled in four classes was 216 out of which 188 students passed the course or took the 
final exam. 

Table 2 shows the variable included in previous and current studies. All of these 
variables are quantitative values. The variables are categorized in to two types: student 
level data for which observations for each student was available, and course level data 
for which observations was only available at course level. Ten out of eleven variables 
are independent variable, one variable (course final score) is dependent variables. These 
variables were extracted from course records stored in SCHOLAT database. The course 
final scores were not present in the database, so the relevant course instructor was asked 
to provide. The main cause of selection of these four classes was availability of course 
final scores. 

Earlier, we used two types of descriptive statistical techniques: univariate and 
bivariate [20]. The univariate technique was used to uncover the properties of individual 
variables, whereas bivariate analysis was used to discover relationship between inde- 
pendent and dependent variables. We intend to use same statistical techniques in this 
study. This study has an added advantage that results will be compared among four 
different classes. 

The results of the analysis are presented in following three subsections. 


4.1 Univariate Analysis 


We perform univariate analysis on student level variables and dependent variable 
(course final score) as discussed in previous section. Therefore, total seven variables are 
included in this episode of analysis. The results of analysis are shown in Table 3. Five 
descriptive statistics measures have been calculated for each variable in each class. 
These measures include average or mean, mode, median, standard deviation and skew- 
ness. Table 3 is divided into five compartments, where each compartment presenting 
results for one type of descriptive statistical measure. The top row indicates the variables 
included in the analysis. The left most column specifies the class to which the variable 
corresponds to. The top row is not repeated, however, left most column is repeated in 
each sub section of the table. Each column (next to left most column) in the table corre- 
sponds to results of one variable, for example, column under number of logins presents 
results of analysis for this variable only. So if reader wants to know what is mode of 
number of homework submitted on time for class ID279, choose column first for this 
variable and then go to corresponding sub section showing mode of each variable. In 
this sub section, choose the row indicating the desired class, the intersection of that 
column and this row the required answer, 1.e., 105. 
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Table 3. Descriptive statistics univariate mode 


No. of ; Final 
question score 





Mean 


1D279 2.96 86.94 
1D1149 0.26 73.77 
1D1563 |664.53 Jo o [532 |332 [117 82.20 
1D1729 |68.43 |O |O |686 |243 |108 84.27 





E 
1D1563 a w oo a J 86 
m1729|4 jo jo js [n |o jse 
Standard Deviation 
1D279 2.98 11.57 
ID1149 0.44 6.50 
1D1563 | 2695.07 0.00 0.00 3.78 |313 |315 10.29 
1D1729 | 168.10 |0.00 [0.00 [3.83 2.69 |285 8.90 
Skewness 
1D279 -5.60 
ID1149 1.16 -0.91 












1D1563 5.92 jo |o fois [097 [312 -0.76 
D1729 |6.00 |o (o  |-021 |114 [313 -0.95 


4.2 Bivariate Analysis 


Table 4 shows the results of bivariate analysis performed on each of aforementioned 
variables. We calculated correlation co-efficient for each pair of one of the six inde- 
pendent student level variables and dependent variable. The structure of Table 4 is 
similar to Table 3 described in previous subsection with top row describing the variables 
and left most columns the relevant class. The right most columns under final score 


contain all ones since it indicates self-correlation. 
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Table 4. Descriptive statistics bivariate mode 


No. of No. of No. of No. of Final 
question | reply homework | homework | homework | score 
on time not hand-in 





Correlation Co-efficient (r) 


“Significant at p-value < 0.05, 95% confidence interval. 





4.3 Course Level Variables 


Table 5 shows the obtained figures for course level variables. We did not perform any 
calculations on these variables like we did on student level variables, since these figures 
are available only at course level. However, we present a discussion on its implications 
in next section. The class ID1563 and ID1729 belong to same course, so in Table 5 the 
figures for these two classes are presented jointly. 


Table 5. Course level variables 


ID279 ID1149 ID1563/1ID1729 
Total number of homework do lo o 8s Z äë | 13 
Number of hits course notice 6041 54 


Number of download course resources 5969 9375 2176 


Number of comments 0 


5 Discussion 


In this section, we present discussion on results presented in previous section. The 
discussion is focused on finding the data supported answers of research questions 
presented in Sect. 3. 

Table 3 presents the results of univariate analysis. Five types of statistical measures 
have been presented. We discuss each of them in following text. 

The number of logins shows the extent of system use by the students. The mean, 
mode and median values given in Table 3 show different usage behavior shown by four 
classes. The classes ID279 and ID1563 have high average values, 1.e., 213.34 and 
664.53, whereas classes ID1149 and ID1729 show low usage average, 1.e., 22.32 and 
68.43. Mode is value mostly appearing in data [22], the classes ID1149 and ID1729 have 
mode values, 1.e., 16 and 39, the class ID1563 has mode 3 and class ID279 has mode 0. 
Median is the value appearing in the center of score continuum. If the distribution is not 
normal, median gives a decent idea where is the center of data. The median of class 
[D279 is O (range: 0—14841) indicates highly unbalanced use of system, 1.e., some 
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students using system not at all while some other students usage is very high. A similar 
trend is observed in class ID1563, 1.e., median 47 (range: O—17053). Earlier high average 
values for these two classes were observed. The other two classes show relatively 
balanced usage, 1.e., median is 19 (range: 5—46) for class ID1149 and median is 41 
(range: 11—1059) for class ID1729. From above discussion it can concluded that classes 
[D279 and ID1563 show high and unbalanced usage while classes ID1149 and ID1729 
show low but balanced usage. We also note presence of outliers in classes [D279 and 
ID1563. The high values of standard deviation also signifies presence of varied usage 
behavior. The classes ID279, ID1563 and ID1729 shows high variability in system 
usage, 1.e., 1667.57, 2695.07 and 168.10. The class ID1149 has relatively consistent 
usage pattern and low value for standard deviation, 1.e., 10.59. 

The variables number of question and number of reply are indication of level of 
collaboration, since asking questions and getting replies promotes collaboration and 
interaction among students. The descriptive statistics shown in Table 3 indicates that 
these activities are highly overlooked by students. The mean, mode, median and standard 
deviation values for these two variables for classes [D1563 and ID1729 are zero. For 
other two classes these values are very low. The data indicates very low collaboration 
among the students. 

The SCHOLAT course module has comprehensive facility for instructors to upload 
homework and for students to submit homework. Submitting homework on time is 
indication of good behavior whereas late submission or not submitting at all indicates 
procrastinating behavior which is alarming for students’ learning. The three variables 
number of homework on-time, number of homework late and number of homework not 
hand-in are related to this key aspect of course administration. The statistics provided 
in Table 3 rather provide a satisfactory view, as mean, mode and median values for all 
classes indicates a healthy trend of submitting assignments on-time and to avoid procras- 
tination. For class ID279 the mean value is 85.4% of maximum values of 105. Whereas 
for other classes ID1149, ID1563 and ID1729 the mean values are 28.6%, 41% and 
52.3%, although lower than class [D279 but still satisfactory. The standard deviation 
values for all four classes are not high indicating consistent behavior of submitting 
homework timely. The mean, mode and median values for variables number of home- 
work late and number of homework not hand-in indicate that the students tend not to 
procrastinate because it can lower their final scores. In Table 3, mostly zero and low 
values for mean, mode and median are observed for these variables. Further, low 
standard deviation values indicate that avoiding procrastination is universal trend among 
all four classes. 

The final score variable is indication of successful learning. The scores have 
maximum value of 100. The mean, mode and median values in Table 3 indicate that 
despite all odds students manage to get good scores. The low standard deviation speaks 
of healthy trend of getting good scores after all among all classes. 

However, the distribution of data is not normal and we mostly see skewed distribu- 
tions. The skewness is measure of how much a frequency distribution is asymmetric 
[23]. A normal distribution has skewness values to zero, whereas positive or negative 
non zero values indicates a positively or negatively skewed distribution. Table 3 indi- 
cates that most distribution are positively or negatively skewed. There are also outliers 
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present in data which make distribution further skewed. The skewed distribution speaks 
of extreme behavior by students, 1.e., against a normal distribution in which an average 
behavior is mostly seen. 

Table 4 shows the results of bivariate correlation coefficient analysis. The analysis 
was done to find out the strength of relation between each of independent variables in 
each course and the final scores. The values of correlation coefficient r illustrates the 
strength of positive or negative relationship, with two extreme values +1 or —1 (strong 
positive or negative relation) and value of O for no relation at all [24]. The variables have 
weak to moderate relationships. But very few significant correlations (p-value < 0.05) 
are observed. The course [D279 has only one significant correlation, i.e., number of 
homework not hand is negatively correlated with final scores (r = —0.35). In course 
ID1149, the number of logins has positive significant correlation with final scores 
(r = 0.44) and number of homework late has negative significant correlation (r = —0.39). 
In courses [D1563, the number of homework on time has positive moderate significant 
correlation with final scores (r = 0.41) and number of homework not hand-in has nega- 
tive moderate significant correlation (r = —0.50). Finally, in course [D1729 the number 
of homework on-time has positive moderate significant correlation with final scores 
(r = 0.42) and number of homework late has negative moderate significant correlation 
(r = —0.50). 

Next we examine the course level data. This is data either same for entire group or 
figures were not available for individual students. For example, total number of home- 
work is same for all students, 1.e., 105, 8 and 13 for classes ID279, ID1149 and ID1563/ 
ID 1729 respectively. For other three variables, number of hits course notice, number of 
downloads course resources and number of comments, the figures were not available 
for individual students. However, these figures are helpful to find out the extent of use 
of beyond class room on-line services. Since, the figures are not available for each 
student, we cannot use them to correlate to student’s achievement. 

First, we look at variable number of hits course notice. The figures of this variable 
for classes under study, 1.e., ID279, ID1149 and ID1563/ID1729 are 6041, 553 and 54, 
where number of notices issued by course instructors were 30, 18 and 1 respectively. 
As such, we see more activity in class [D279 reading and staying in touch with course 
instructor. Similarly, we can view number of downloads of course resources as another 
indicator of on-line interactive activities. Table 5 indicates that in class [D279 students 
downloaded course resources 5969 times, in ID1149 9375 times, and in ID1563/ID1729 
2176 times, whereas 61 resources were uploaded in ID279, 63 in ID1149 and only 18 
resources were uploaded in classes ID1563/ID1729. We see rather healthy trend that the 
students are benefited from this facility effectively, which is indicative by the number 
of downloads in each class. 

We observe a disappointing activity in terms of comments posted by students. For 
our classes, there are only 30 comments for [D279, 3 comments for [D1149 and no 
comment for other two classes combine. These comments can be useful for course 
instructors and other students willing to join the course. 
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6 Conclusion 


The important findings of the above analysis can be summarized as follows: 


There is varied trend of system usage, 1.e., in some classes students use system more 
and in some do not. 

Students do not take much interest in on-line collaboration activities. 

Students try their best to submit their homework and avoid procrastination. 

In each class there are different factors which are significantly related to students’ 
learning success (course final scores). We see both positive and negative correlations. 
The distributions are skewed, which shows above or below normal behavior. 


7 Future Work 


In future, we intend to extend this work for building machine learning algorithm to 
predict students’ current learning and future achievement. This would enable course 
instructors to take timely actions to avoid students’ failures. We also intend to increase 
data collection ability of the system. 
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Abstract. We extract the data set from 2010 to 2014 of the China Computer 
Federation (CCF (http://www.ccf.org.cn)) by the distributed web crawler system 
and build the co-author network with R language. In this co-author network, 
authors represent nodes and a pair of authors is connected by an edge if they have 
co-authored at least one paper over the entire duration. We analyze this network 
with social-centric and ego-centric methods to study the situation of co-author 
network of CCF and visualize the analysis results. Social-centric measure reveals 
that the co-author network density, the usage of key words, the average number 
of authors per article, and confirms that most authors in computer science field 
publish a very small number of papers but have higher collaborators than those 
of other fields. Ego-centric analysis discovers betweenness centrality, closeness 
centrality, and degree centrality, indicates that only a small percentage of the 
authors locate in the center of the coauthor network. Finally, we pick out eight 
key persons and point out the group teams where the key persons are respectively. 
Based on these findings, we suggest that computer science field should promote 
wider collaboration, encourage more authors to publish their papers. 


Keywords: Co-author network - Social-centric - Ego-centric - Centrality 
Key person - Visualize 


1 Introduction and Motivation 


CCF aims at bringing scholars together in computer research. CCF was founded in 1962, 
member of China Association of Science and Technology. It contains 13 kinds of journals, 
namely Chinese Journal of Computers, Journal of Computer-Aided Design & Computer 
Graphics, Journal of Computer Science & Technology, Journal of Computer Research and 
Development, Journal of Software, Journal of Computer Applications, Computer Engi- 
neering and Applications, Computer Technology and Development, Computer Science, 
Journal of Chinese Computer Systems, Journal of Frontiers of Computer Science and Tech- 
nology, Computer Engineering and Science, and Computer Engineering and Design. 
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Social network analysis (SNA) methods have been used to study co-author network 
in various fields including physics, library and information service (LIS), biology, and 
computer science. Scholars in LIS field need strengthen communication with each other 
after studying the co-author network of LIS with betweenness centrality, degree 
centrality, and average degree features [1]. Zhu et al. [2] discovered the present situation 
and pattern of Chinese-foreign cooperation, simultaneously offered suggestions to the 
international collaboration through studying the co-author network of information 
system field. Sun et al. [3] figured out the most influential author and paper in the co- 
author network with network density, betweenness centrality methods. Shen et al. [4] 
applied the vector space model into the identification of scientific research teams within 
the co-author network, and revealed the scientific research co-author relationship by 
degree of similarity between author vectors. Du [5] extracted the co-author network over 
the past 26 years of USIT (User Interface Software and Technology), which is generally 
recognized as the top conferences in the Human-Computer Interaction field, and 
analyzed the co-author network with SNA methods. El Kharboutly and Gokhale [6] 
revealed the collaborative pattern of co-author from the co-author network that extracted 
over the entire history of the SEKE conference. 

After retrieving and reading paper about social network analysis, we can deeply 
understand the collaboration, communication among authors. Thus, we can offer sugges- 
tions on how to improve and strengthen the collaboration and communication among 
authors, but no one has investigated into CCF co-author network before. Therefore, it 
is necessary for us to study the CCF co-author network. In this paper, the main of 
contribution of our work is summarized as follows: 


1. Design and implement the distributed crawler system; collect the data set from CCF 
through the distributed crawler system; build the co-author network of CCF. 

2. Discover the co-author network with social-centric and ego-centric. Social-centric 
analyzes the major metrics of the data set and compares the metrics with other field. 
Ego-centric reveals the betweenness centrality, closeness centrality, degree 
centrality of the co-author network. 

3. According to the analysis of key words, we point out the research hotspots and 
research trendy, find out the key persons with the higher degree centrality, and gain 
the group teams where the key persons are from the co-author network. 


The rest of paper is organized as follows: Sect. 2 describes data collection and pre- 
processing. Sections 3 and 4 discuss socio-centric and ego-centric analysis respectively. 
Section 5 points out key person and the group team. Section 6 concludes the paper and 
offers directions for future work. 


2 Data Collection and Pre-processing 


We extract the data set from the last recent 5 years of CCF by the distributed crawler 
system. Table 1 shows the major metrics of CCF co-author network. The distributed 
crawler system consists of a master node and several slave nodes. Master node is the 
core of the crawler system, and it is responsible for scheduling task, managing process 


424 C. Fu et al. 


and crawling log, while every slave node is a performer of the task. Furthermore, slave 
node is a plug-in base on the Google browser Chrome, thus it can run on any computer 
that has chrome and run independently with other programs [7]. Architecture of the 
distributed crawler system is presented in the Fig. 1. 


Table 1. Major metrics of CCF co-author network 


Metric Metric Value 


Total number of authors | 49453 70631 
words 

Average papers per author | 1.35 4.38 
papers 

Average collaborators per | 4.88 Average number of key | 9.09 

author words per author 

Average clustering 0.0037 

coefficient 





CCF paper websites 


Slave node Slave node Slave node 





Master node 


Fig. 1. Architecture of the distributed crawler system. 


Since data obtained from the web by the crawler system are JSON (JavaScript Object 
Notation) format, we parsed and formatted data into XML format by program. Further- 
more, in pre-processing we found that some (only 10—20 instances) authors share their 
first name and last name. We disambiguated between such authors by their emails, 
assuming that authors who share a name but not email represent different individuals. 

After pre-processing, we build the CCF co-author network, which authors represent 
nodes, the pairwise of authors is connected by an edge if they have co-author at least 
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one paper. In addition, this analysis of CCF co-author network is an unweighted graph. 
That is, although authors A and B have multiple co-authored relationship, there is only 
one edge connect author A with author B. Figure 2 shows the overview of the co-author 
network graph in CCF. 





Fig. 2. The overview of the co-author network graph in CCF 


3 Social-Centric Analysis 


In this section, we discuss social-centric metrics that we computed overall nodes in the 
CCF co-author network. We compare these metrics, shown in the Table 1, with those 
of other fields. 


3.1 Major Metrics of CCF Co-author Network 


From Table 1, we know that total number of authors is 49453, and total number of key 
words is 70631. Furthermore we figure out that on an average a CCF author writes 1.35 
articles, collaborates with 4.88 authors. Comparing with other communities within and 
beyond computer science, on an average a CCF author writes 1.35 articles is lower than 
one in LIS (2.71), biology (6.4) and physics (5.1), otherwise, on a average a CCF author 
collaborates with 4.88 authors is higher than one in LIS (1.57), biology (3.75) and 
physics (2.53) [2, 8, 9]. These differences may arise due of the nature of LIS, biology 
and physics. Additionally, on an average a CCF article has 4.38 authors, and on an 
average a CCF author has 9.09 key words. 

Figure 3, which shows the distribution of number of articles per author with less than 
10 articles, reveals that a very large of percentage of CCF authors have less than 3 
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articles. Authors with 40 or more articles are very rare, with 87 being the maximum. In 
order to analyze the distribution of articles per author, we extract the authors data over 
the entire data set where per author has 10 or more articles. Figure 4, which shows the 
distribution of number articles per author who has 10 or more articles further confirms 
that author with 40 or more articles are rare. 
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Fig. 3. Distribution of number of articles per author with less than 10 articles. 
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Fig. 4. Distribution of number of articles per author with 10 or more articles. 


3.2 Research Major Trend 


Research hotspot is the trend in special field. In CCF co-author data set, there are 70631 
key words. In order to point out the research trend of computer science, all those key 
words, that have been use more than 1000, are referred to as Main Key Words (MK Ws). 
Figure 5, which shows MKWs of CCF co-author, indicates that seven MK Ws occupy 
31.02%, which is a larger proportion than any other key words. Seven MKWs are data 
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mining, support vector machine (SVM), clustering, wireless sensor network, genetic 
algorithm (GA), segmentation, and cloud computing. Apparently, they are consistent 
with big data, Internet of Things (IoT), image processing, community discovery and 
cloud computing research major trend. 


Clustering 


Wireless Sensor 





Network Data mining 
Genetic Support Vector 
Algorithm Machine 

Segmentation Cloud 


Computing 


Fig. 5. Proportion of MK Ws in CCF data set 


3.3 Co-author Network Density 


The network density (D) is defined as the number of edges E to the number of possible 
edges and is given by Eq. (1) [10]. The density of the CCF co-author network is 0.0037, 
indicating an overall very sparsely connected network. 


= 2 E 
= N(N —1) (1) 


4 Ego-Centric Analysis 


In this section, we discuss ego-centric metrics of centrality which is a tool to understand 
structure, function, and composition of the co-author network tie around the individual. 


4.1 Betweenness Centrality 


Betweenness centrality is used to answer the question of who controls knowledge flows. 
Betweenness centrality is defined as the number of the shortest paths from all authors 
that pass through the given author [11]. It is an indicator of an author’s centrality in a 
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network or an author’s ability to control the knowledge flows, resources and information 
[12]. Betweenness centrality of author v is given by Eq. (2), where P; ; is the total number 


of geodesic linking author i and author j, and P; (v) is all the geodesics linking author i 
and author j which pass through author v. Authors with high betweenness centrality can 
play a role of “bridge” or “middleman” in the co-author network, also the author with 
high betweenness centrality can obtain resources, information and knowledge efficiently 
from other authors in the co-author network [13]. Table 2 shows the top 10 authors with 
betweenness centrality in CCF co-author network. 


P.(v) 
Co) = È ijv P, (2) 
ij 





Table 2. Top 10 authors with betweenness centrality in CCF co-author network. 


Author alias Betweenness | Author alias Betweenness 
a centrality 


28452 21029.19 | 19 25214 13763.72 


9161 4921 14594.92 
6151 22731.28 
5066 30312 13731.08 
28719 13689.25 


4.2 Closeness Centrality 


Closeness centrality is used to answer the question of who has the shortest distance to 
other authors. Closeness Centrality is defined as the mean length of all shortest paths 
from a node to other nodes in the network [14]. It is measured as the average of the 
reciprocal distance of an author from others. Closeness centrality of an author v is given 
by Eq. (3), where d(v, 7) is the distance between authors v and j, while N is the total 
number of authors where author v can reach in the co-author network. Closeness 
centrality judges how important an author is. As we know, the higher closeness centrality 
is, the more important the author is. Moreover, an author with higher closeness centrality 
could access or obtain resources in the co-author network more efficiently than others 
with lower closeness centrality [15]. Additionally, an author with higher closeness 
centrality also indicates that an author can communicate efficiently than those with lower 
closeness centrality [12]. Table 3 shows the top 10 authors with closeness centrality in 
CCF co-author network. 





N 1 
C.(v) = > dey) (3) 
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Table 3. Top 10 authors with closeness centrality in CCF co-author network. 


Author alias Closeness centrality 
44551 43639 4.769e—10 
45050 36614 4.761e-10 
31895 45238 4.752e—10 
36626 18639 4.740e—10 
31865 18951 4.722e-10 


4.3 Degree Centrality 


Degree centrality is used to answer the question of who knows the most authors. It is 
measured as the tie of author with others. Degree centrality is defined as the number of 
links incident upon a node [16]. Degree centrality of author v is given by Eq. (4). Authors 
with higher degree centrality represent they are more central to the co-author network. 
Table 4 shows the top 10 authors with ten degree centrality in CCF co-author network. 


Dv) = deg(v) (4) 


Table 4. Top 10 authors with degree centrality in CCF co-author network. 


Author alias Degree centrality Degree centrality 
28452 13772 85 
71 


20252 mM | 835 58 
14929 30312 52 
28719 59 15748 53 


20222 18169 47 


5 Identifying Key Person and Group Team 


Above the discussion, we try to discover key persons and group teams. According to 
the value of centrality, we regard the author as a key person who is in the central position 
of CCF co-author network. Moreover, we point out group team members who have a 
certain count of collaboration with the key person. 


5.1 Identifying Key Person 


Actually, key person indicates that he or she has high quantity and quality of their 
research work in coordination with other scholars and may be at the top fields currently. 
Above the discussion, we know that an author with higher degree centrality is more 
central in the co-author network and the author accesses or obtains resources and infor- 
mation more efficiently than other authors with lower degree centrality. Therefore, we 
define an author as a key person that degree centrality is higher than 50 in the CCF co- 
author network. Table 5, which shows the key persons with degree centrality and number 
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of publishes in the CCF co-author network confirms that key persons not only have 
higher degree centrality, but also have higher number of publications. 


Table 5. Key persons with degree centrality and number of publications in CCF co-author 


network. 
Author alias Degree centrality Number of publish 
28452 73 85 
13772 85 87 
20252 71 51 
851 58 47 
14929 55 57 
30312 52 42 
28719 59 42 
15748 53 56 


5.2 Group Team Discovery 


As usual, key person has a group team or laboratory as a support, furthermore, key person 
is also the soul in the group team. Therefore, we point out group teams through collab- 
oration with key person. Firstly, group team appears as a cluster in the co-author network, 
that is, a key person has much collaboration with others in the group team and key person 
is in the center of group team co-author network. Secondly, those authors who have less 
than 3 co-authored relationships with key person may represent collaborations across 
institutions but not the member in the group team where key person is. Therefore, while 
selecting the group team members, we filter those who have less than 3 co-authored with 
the corresponding key person. Figure 6 shows eight key persons and their group team 
clusters in the co-author network. The bigger sizes nodes stand for the key persons, other 
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Fig. 6. The overview of the co-author network in group teams 
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sizes nodes represent other authors that have collaboration with key persons. From 
Fig. 6, apparently, a large proportion of collaboration exists in the group team, but there 
is rare collaboration across group teams. Base on the above finding, there will be fruitful 
produce if authors collaborate frequently with others across group teams. 


6 Conclusions and Future Work 


In this paper, we describe the process of extracting the data set from last resent 5 years 
CCF paper websites and building the co-author network of CCF. We analyze CCF co- 
author network using social-centric and ego-centric social network analysis methods to 
understand the pattern of author collaboration and communication. Otherwise, we 
analyze the MK Ws to confirm the research trend in computer science field. Moreover, 
we identify the key person in the CCF co-author network through the authors’ degree 
centrality. Finally, according to the collaboration with key person, we point out eight 
group teams where eight key persons are respectively. 

Our future research involves co-authorship order to understand how the pattern of 
collaboration has been influenced. Moreover, we will deeply study group team research 
interest by analyzing the key words of group team members. 
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Abstract. Postpartum depression (PPD) is a depressive disorder with peripar- 
tum onset, which brings heavy burden to individuals and their families. In this 
paper, we propose to detect PPD in depressed people via voices. We used 
openSMILE for feature extraction, selected Sequential Floating Forward 
Selection (SFFS) algorithm for feature selection, tried different settings of fea- 
tures, set 5-fold cross validation and applied Support Vector Machine (SVM) on 
Weka for training and testing different models. The best predictive performance 
among our models is 69%, which suggests that the speech features could be 
used as a potential behavioral indicator for identifying PPD in depression. We 
also found that a combined impact of features and content of questions con- 
tribute to the prediction. After dimension reduction, the average value of 
F-measure was increased 5.2%, and the precision of PPD was rose to 75%. 
Comparing with demographic questions, the features of emotional induction 
questions have better predictive effects. 


Keywords: Postpartum depression - Depression > Speech features 
Detecting - Classification 


1 Introduction 


Postpartum depression (PPD) is a kind of depressive disorder with peripartum onset, 
which can affect both genders after childbirth, and females usually suffer worse than 
males [1]. It is a heavy burden to not only patients themselves, but also their spouses, 
children and whole families [2]. The concept “PPD” was proposed by Pitt. B in 1968 
[1], but there is still no agreement on its diagnostic criteria until now. PPD is one 
subtype of depressive disorder. It is liable to cause misdiagnosis between PPD and 
other subtypes of depressive disorder [3]. Accurate diagnosis 1s the critical for effective 
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intervention. In the case of inconsistent diagnosis, it is necessary to develop new 
methods to aid PPD diagnosis. 

Depressive disorder has a visible influence on patients’ emotions [4]. The influence of 
emotion on people could reflect in their voices [5]. The study of Cannizzaro [6] shown that 
depressed people had less verbal production and variations comparing with healthy 
people. Sobin and Sackeim [7] summarized some specific speech features in depression, 
including slow speech speed, increased pause duration, and so on. As a subtype of 
depressive disorder, the influence of PPD on patients should reflect on their voices as well. 

The diagnosis of PPD by using voice is feasible. First of all, speech-based diagnosis 
did not disturb patients too much. The procedure can be set during interviewing. 
Patients need not finish any complex or time-consuming tasks. Secondly, the check 
method is easier to hide. Behavioral features are easy to become untrue, because people 
are able to control their behaviors [7]. The method of speech check can avoid to 
directly contact with patients, which is benefit for data’ ecological validity. 

We proposed to detect PPD from depressed patients by using speech features. The 
purpose of this study is to investigate the effect of detecting PPD within depression via 
speech features under the state of natural experiment. The voices we used were orig- 
inated from the conversation between patients and doctors recorded during inter- 
viewing. Patients’ voices were separated from these recordings, and divided into two 
group: PPD and non-PPD. Speech features extracted by OpenSMILE, and 988 features 
were extracted in all. We used all speech features and features after dimension 
reduction to predict PPD, respectively. The predictive effects of speech features were 
evaluated in the light of three indexes: precision, recall and F-measure. 


2 Related Work 


Voice is one way of emotional expression. Speech features have been found to be able 
to identify different emotions. Nwe et al. [8] reported that classifying voice as different 
emotions based on HMM (hidden markov model) had a higher accuracy rate (average 
7.7%) than artificial judging, the average rate was up to 78%. Wu et al. [9] used 
prosodic and spectral features to identify seven emotions, with the best precision as 
91.6%. From the above studies, we know emotional can be predicted based on speech 
features. It motivates us to identify mental illness like PPD using speech features. There 
have been few studies about PPD patients’ voices. We think those studies about 
phonetic changes of depression probably can be generalized to PPD. 

The sounds of depressed patients have significant changes because of their illness 
[6]. The diagnostic speech features of depression in DSM-5 are described as slow, 
volume sank, variation of tone lessen, pause duration increase [4] (P163). Experiments 
revealed some specific vocal indicators in depression, such as speech speed slow down 
[10], increased pause duration and times [10, 11], shortened duration of utterance [12], 
longer initiative time latency [13]. Speech features in depression express with changes 
of FO such as the decreasing of bandwidth, amplitude, energy [10, 14], shrunken FO 
range (AFO) [14], weakened intensity [15], variation of frequency spectrum like 
shrunken second formant transition [12] and shrunken spectral energy distribution [16], 
and so on. 
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Reviewing recent findings, Cohen and Elvevag [17] believed that computer-based 
assessments of natural language has the potential for measuring speech disturbances in 
people with severe mental illnesses. Some researchers attempted to predict depression 
via patients’ speech. Mundt et al. [11] stated that the regression model consisted of FO, 
pause duration, speech speed, speech duration, etc. could predict depression, and the 
explanatory power of model to depression reaching 79.2%. Emerging evidence sug- 
gested that speech features have a strong performance in predicting depression, which 
obtained a RMSE of 10.17 well below the baseline of 14.12 [18]. The study of Cohn 
et al. shown that the accuracy in detecting depression was 79% for vocal prosody [19]. 

In our study, we choose the depressed voices collected from natural circumstance to 
improve ecological validity, differing from the controlled experimental environments in 
the previous studies. In clinical diagnosis, it is more crucial and harder to make a 
distinction between different mental illnesses than distinguish healthy people from 
psychiatric patients. To improve the differential diagnosis among depressive spectrum 
disorders, our detective aim is detecting PPD within depression. 


3 Methods 


3.1 Participants 


In this study, patients’ voices were secondary data which acquired from CONVERGE 
(China, Oxford and VCU Experimental Research on Genetic Epidemiology) project of 
MDD which recorded during interviewing. Our analyses were based on a total of 740 
depressed patients recruited from 58 provincial mental health centers and psychiatric 
departments of general medical hospitals in 45 cities of 23 China provinces. All 
patients were female. They were excluded if they had bipolar disorder, intelligence 
deficiency or any type of psychosis. Patients were aged from 30 to 60, the mean age 
(standard deviation) of them was 44.4 (8.9). More details of this research include 
diagnosis and measures were described in [20]. 


3.2 Data Acquisition 


Voices have been collected by recording pens during computer-based interviewing. All 
interviewers were professional medical staffs, and trained on how to carry out the 
interview by CONVERGE team for at least a week. The Interview (equal to the content 
of recordings) includes assessment of demographic, family history, life events, psy- 
chopathology (e.g. depression, anxiety, mania, psychosis, PPD) and psychosocial 
functioning. The interview lasted on average two hours. The answers of patients in 
interviews are the data what we want. 


3.3 Data Preprocessing 


The audios were recorded for auditing by trained editors who provided feedback on the 
quality of the interviews at the beginning, thus noise control had not been planned ahead. 
Since we only analyze the voices of patients’, data preprocess is required before usage. 
There are three steps of data preprocessing before speech features extraction (see Fig. 1). 
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Fig. 1. The procedure of data preprocessing 


The first step was matching and exclusion. First of all, we exported 11875 MP3 
files from CONVERGE database. We need to match recordings with the results of 
psychological scales. We only left those patients who have both voice recording and 
questionnaire results. After matching, a total of 4243 patients were remained. The next 
step is exclusion. To make sure that enough recordings with enough length were used 
in the following steps, the short recordings should be excluded. On the basis of our 
experiences, we excluded those recordings which less than half an hour. Finally, a total 
of 3964 patients remained in this step. 

The second step was screening. There were all kinds of noises in these recordings. 
We selected high-quality recordings to avoid the impact of noises on the predictive 
results. We divided all recordings into different levels according to the certain evalu- 
ation criteria (see Table 1). Finally, 774 recordings of level A were labelled, which 
were used for the further process. 


Table 1. Evaluation criteria of voices in different levels 





Level | Evaluation criteria 











A Background is quiet 

B Background has light noises 

C Noisy, hard to be cut 

D The voice of patient is not clear? 


“Clear, means the voices are 
distinguishable, the content could be easily 
understood via hearing 


In the last and most important step, our mission was cutting and denoising. We 
needed to separate the voices of patients from interviewers, and wiped out other noises. 
We recruited a few workers to engage in this part of work. The requirements of this 
work including: (1) the voice clips should be longer than 5 s; (2) the noises need to be 
cut include but not limited to ring, telephone ring, click, voices of other people, and so 
on. There were 740 patients remained after denoised, including 21459 voice clips. Each 
voice clip is equal to one answer of one patient. 


3.4 Feature Extraction, Selection and Data Analysis 


Open SMILE [21] was used to extract speech features. A total of 988 speech features 
were extracted. The procedure of feature extraction was as follows: firstly, in view of 
the above related work, 26 basic speech features were extracted from recordings, 
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including intensity, loudness, zero cross rate, voicing probability, FO, FO envelope, 
eight linear spectral pair frequencies (LSP) and twelve Mel-frequency cepstral coeffi- 
cients (MFCC). Secondly, to investigate the variability of voices, 26 features were 
turned into their first-order derivatives. Thirdly, we calculated 19 statistics of those 
features mentioned above, such as mean, standard deviation, range, etc. At last, we 
acquired 988 (= (26 + 26) x 19) speech features. 

It is expected that there are some irrelevant and redundant data will weaken the 
prediction performance. Therefore, it is not a good idea to input all speech features for 
prediction. Speech features should be selected before prediction. For choosing relevant 
features and achieving dimension reduction of speech features, we used the Sequential 
Floating Forward Selection (SFFS) algorithm. 

Data analyses mainly includes classification and correlation analysis. Patients were 
divided into two groups in the light of whether they have been diagnosed as PPD or 
not. The group labels were considered as golden standard in classification: patients with 
PPD were labeled 1, without PPD were labeled 0. As classification, we implemented 
SVM and 5-fold cross validation for training and testing different models. To figure out 
whether there are salient relationships between the independent variables “number of 
features” and “sample size” and the predictive effects, we used partial correlation 
analyses to test them. In addition, paired-sample t-test was used in trying out the 
impacts of dimension reduction and the content of question on the predictive effects. 


4 Results 


4.1 Prediction 


The rate of PPD group and non-PPD group was kept 1:1 to ensure that sample size has 
no obvious impact on predictive results. In order to directly observe the effect of 
dimension reduction, we used all speech features and features after dimension reduc- 
tion to predict PPD, respectively. The results are respectively shown in Tables 2 and 3. 
We list the results of top ten best-performing questions, and order by the sample size 
from small to large. 

In Table 2, the classification was based on 988 speech features. The best predictive 
result of F-measure is 65%, which is the reply to one question of depression scale. 
Observation of row 4-9, we found that different questions with same sample size had 
different predictive powers, considering the speech features used in these ten questions 
were the same. In addition, the predictive effects of demographic answers were com- 
mon lower than the other questions. 

In Table 3, the number of features were dramatically reduced after dimension 
reduction. The best predictive result 69% of the selected features is PSY.3, which is the 
reply to one question of psychosis scale. The average value of F-measure was increased 
5.2%. By looking into row 4-9, we found that the differences of predictive effects of 
different questions with same sample size decreased. 
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Table 2. Results of classification using 988 features 


Question Precision | Recall | F-measure 


Sample size 



































80 DEP.E24.F | 0.65 ; 0.65 
90 PSY.4 , . 0.57 
120 PSY.3 i : 0.61 
120 DEP.E29 ; ' 0.57 
130 GAD.D64.D | 0. 0.61 
130 DEP.E26 : 0.53 
170 D2.B 0.46 
170 D4.A i 0.49 
190 D6.A i i 0.55 
220 D6 ; ; 0.53 
Average 0.557 


In the second column, these abbreviations before the first 
point represent the corresponding questionnaire, the content 
after the first point represents the corresponding item (means 
question) number in questionnaire (except scale 
demographics, the content after letter “D” is the item 
number). DEP, depression scale; PSY, psychosis scale; 
GAS, scale of general anxiety disorder; D, demographics. 


Table 3. Results of classification after dimension reduction 


Sample size | Number of features F-measure 
































80 DEP.E24.F | 0.63 0.62 
90 35 PSY.4 0.59 
120 15 PSY.3 0.69 
120 18 DEP.E29 0.64 
130 28 GAD.D64.D 0.62 
130 12 DEP.E26 0.62 
170 13 D2.B 0.53 
170 30 D4.A 0.53 
190 9 D6.A 0.63 
220 12 D6 0.62 
Average 0.609 


In the second column, these abbreviations before the first point represent the 
corresponding questionnaire, the content after the first point represents the 
corresponding item (means question) number in questionnaire (except scale 
demographics, the content after letter “D” is the item number). DEP, depression 
scale; PSY, psychosis scale; GAS, scale of general anxiety disorder; D, 
demographics. 


PPD and non-PPD Confusion matrixes were shown in Table 4. The precision of 
PPD was markedly improved after dimension reduction, reaching 75%. In contrast, the 
precision of non-PPD had slightly decreased after dimension reduction. 
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Table 4. Confusion matrixes of the most effective questions 





Re H 


a) 988 features (0, non-PPD; 1, PPD) 15 features (0, non-PPD; 1, PPD) 


4.2 Correlation Analysis and Significance Test 


Do independent variables sample size and the number of features have significant rela- 
tionships with predictive effects? To figure out it, partial correlation analyses were 
applied in consideration of the impact of the other factor. The analyzed result between the 
number of features and predictive indexes exhibited that there is no salient correlation 
between them, after controlling the sample size (precision: r = —0. 461, p > .1; recall: 
r = —0.422, p > .1; F-measure: r = —0.473, p > .1). The correlation analysis between 
sample size and predictive indexes (see Fig. 2) shown that sample size had a significant 
negative correlation with three indexes which predicted by 988 features after controlling 
the impact of the number of features (precision: r = —0. 697, p < .05; recall: r = —0.691, 
p < .05; F-measure: r = —0.691, p < .05). However, there is no significant correlation 
between sample size and three indexes which calculated via reduced features. 


70 


N 
SZ —— precision 
oo Pa IN a recall 
——— F_measure 
55 —— dr precision 
FA a — dr recall 
——dr_F_ measure 
50 
45 — 
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Sample size 
Fig. 2. The impact of sample size on predictive indexes (dr, dimension reduction) 
To make it clear that if dimension reduction can evidently improve the predictive 


effect or not, we compared the predictive results of total features with reduced features 
by using paired-sample t test (Table 5 lists the means and standard deviations of 
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predictive indexes). The results indicated that the predictive results improved signifi- 
cantly after dimension reduction (precision: t = —4.763, p < .01; recall: t = —4.31, 
p < .01; F-measure: t = —4.061, p < .01). Further analysis, considering the impact of 
question, the contrast of three pairs questions with same sample size was ran by 
paired-sample t test. The results shown that the differences between different questions 
with same sample size had saliently shrank after dimension reduction (120 sample size: 
t = —1, p = .42; 170 sample size: t = 7, p < .05). The results of the sample size of 130 
cannot test t value because their SD is zero. But their difference value’s change is the 
largest after dimension reduction. 


Table 5. The means and standard deviations of predictive indexes 











988 features Reduced features 


Precision | Recall | F-measure | Precision | Recall | F-measure 
Mean 


55.90 55.89 | 55.79 61.50 61.10 | 60.90 
SD 5.92 5.88 | 5.77 4.77 4.64 


SD, standard deviation 









5 Discussion 


The purpose of this study is to detect PPD in depressed patients via speech features. We 
used openSMILE for feature extraction, selected SFFS for feature selection, tried 
different numbers of features, set 5-fold cross validation for strengthening generaliz- 
ability of model and applied SVM in Weka for training and testing different models. 
The best performance of F-measure reaching 69%, comparing with the random pre- 
dictive effect 50%, which suggests that voice could be used as a potential behavioral 
indicator to identify depression disorder’ subtypes. 

We speculate there may be important influences of the number of features, sample 
size and the content of question on the predictive effect. Our results indicated that the 
number of features has no significant relationship with the prediction, but the predictive 
effect is dramatically improved after dimension reduction. The number of features 
among different questions are different after dimension reduction, so we think the 
positive impact of dimension reduction on predictive results is a combined result of 
number of features and content of question. It is unexpected that there is a negative 
correlation between sample size and predictive effect. The probable cause is demo- 
graphic questions lack of the ability of emotional induction, which results in the 
undistinguishable neutral emotion in all patients’ voices. Just as it is shown in Fig. 2, 
questions D2.B and D4.A make significant contributions to obvious dents of curves. 

Different question has different predictive effect. We can find some clues by the 
most effective questions in Tables 2 and 3. The most effective question is DEP.E24.F 
before dimension reduction. This question asked patients to recall the state in their 
severest depressive episode. The most effective question is PSY.3 after dimension 
reduction. This question asks, “have you ever taken medicine for your nerves or the 
way you were feeling or acting ?”. All patients are recurrent depressive sufferers, they 
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must have experiences on taking anti-depressants. Thus, it is a good question to induce 
emotion, because the experiences about psychotropic medicine probably were negative 
due to medicines’ side effects. In summary, the effect of emotional induction of 
questions has an important influence on the predictive effect. 


6 Conclusion 


In this study, the best predictive performance of our speech-based models is F-measure 
69%, which suggests that the speech features could be used as a potential behavioral 
indicator to identify PPD in depressed patients. A combined impact of features and 
question contribute to the improvement of predictive effect. After dimension reduction, 
the average value of F-measure was increased 5.2%, and the precision of PPD was rose 
to 75%. Compared with neutral demographic questions, the features of emotional 
induced questions have better predictive effects. 


Acknowledgments. This work was supported by the National Basic Research Program of China 
(973 Program) (No. 2014CB744603), and Natural Science Foundation of Hubei Province 
(2016CFB208). 


References 


1. Pitt, B.: ‘Atypical’ depression following childbirth. Br. J. Psychiatry 114(516), 1325-1335 
(1968) 

2. Burke, L.: The impact of maternal depression on familial relationships. Int. Rev. Psychiatry 
15(3), 243-255 (2003) 

3. American College of Obstetricians and Gynecologists. Committee on Obstetric Practice, 
Committee opinion no. 453: Screening for depression during and after pregnancy. Obstet. 
Gynecol. 115(2 Pt 1), 394-395 (2010) 

4. Accounts Payable Association: Diagnostic and Statistical Manual of Mental Disorders 
(DSM-5®). American Psychiatric Publishing (2013) 

5. Kramer, E.: Elimination of verbal cues in judgments of emotion from voice. J. Abnorm. Soc. 
Psychol. 68(4), 390-396 (1964) 

6. Cannizzaro, M., Harel, B., Reilly, N., Chappell, P., Snyder, P.J.: Voice acoustical 
measurement of the severity of major depression. Brain Cognit. 56(1), 30-35 (2004) 

7. Sobin, C., Sackeim, H.A.: Psychomotor symptoms of depression. Am. J. Psychiatry 154(1), 
4—17 (1997) 

8. Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov 
models. Speech Commun. 41(4), 603—623 (2003) 

9. Wu, S., Falk, T.H., Chan, W.-Y.: Automatic speech emotion recognition using modulation 
spectral features. Speech Commun. 53(5), 768-785 (2011) 

10. Ellgring, H., Scherer, P.K.R.: Vocal indicators of mood change in depression. J. Nonverbal 
Behav. 20(2), 83—110 (1996) 

11. Mundt, J.C., Vogel, A.P., Feltner, D.E., Lenderking, W.R.: Vocal acoustic biomarkers of 
depression severity and treatment response. Biol. Psychiatry 72(7), 580-587 (2012) 


442 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


J. Wang et al. 


Flint, A.J., Black, S.E., Campbell-Taylor, I., Gailey, G.F., Levinton, C.: Abnormal speech 
articulation, psychomotor retardation, and subcortical dysfunction in major depression. 
J. Psychiatr. Res. 27(3), 309-319 (1993) 

Mandal, M.K., Srivastava, P., Singh, S.K.: Paralinguistic characteristics of speech in 
schizophrenics and depressives. J. Psychiatr. Res. 24(2), 191—196 (1990) 

Porritt, L.L., Zinser, M.C., Bachorowski, J.-A., Kaplan, P.S.: Depression diagnoses and 
fundamental frequency-based acoustic cues in maternal infant-directed speech. Lang. Learn. 
Dev. 10(1), 51-67 (2014) 

Cohen, A.S., Kim, Y., Najolia, G.M.: Psychiatric symptom versus neurocognitive correlates 
of diminished expressivity in schizophrenia and mood disorders. Schizophr. Res. 146(1-3), 
249-253 (2013) 

Tolkmitt, F., Helfrich, H., Standke, R., Scherer, K.R.: Vocal indicators of psychiatric 
treatment effects in depressives and schizophrenics. J. Commun. Disord. 15(3), 209-222 
(1982) 

Cohen, A.S., Elvevag, B.: Automated computerized analysis of speech in psychiatric 
disorders. Curr. Opin. Psychiatry 27(3), 203-209 (2014) 

Cummins, N., Joshi, J., Dhall, A., Sethu, V., Goecke, R., Epps, J.: Diagnosis of depression 
by behavioural signals: a multimodal approach. In: Proceedings of the 3rd ACM 
International Workshop on Audio/Visual Emotion Challenge, New York, NY, USA, 
pp. 11-20 (2013) 

Cohn, J.F., et al.: Detecting depression from facial actions and vocal prosody. In: 2009 3rd 
International Conference on Affective Computing and Intelligent Interaction and Work- 
shops, pp. 1-7 (2009) 

Yang, F., et al.: Clinical features of and risk factors for major depression with history of 
postpartum episodes in Han Chinese women: a retrospective study. J. Affect. Disord. 183, 
339-346 (2015) 

Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in openSMILE, the 
Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM 
International Conference on Multimedia, New York, NY, USA, pp. 835-838 (2013) 


Research on Network Public Opinion 
Propagation Mechanism Based 
on Sina Micro-blog 


Weidong Huang“, Qian Wang, and Yixuan Wang 


School of Management, Nanjing University of Posts and Telecommunications, 
Nanjing, Jiangsu, China 
huangwd@njupt. edu. cn 


Abstract. In the era of micro-blog, network public opinion has become the 
main expression of public opinion. Network public opinion can quickly form a 
network of public opinion, and promote the dissemination of information fis- 
sion, with a strong interaction and effectiveness. In the case of “A girl suffered 
attacks in a Yitel”, we construct network propagation diagram by ucinet soft- 
ware and research the structural characteristics of the network public opinion 
propagation network based on Sina micro-blog, and the whole structure of the 
propagation network and the key nodes are measured. The results show that the 
key nodes in public opinion network propagation has a high ability to spread, so 
we can control the velocity of micro-blog public opinion through affecting these 
key nodes. 


Keywords: Micro-blog - Social network analysis - Key nodes 
Network structure 


1 Introduction 


Because of the clustering characteristics of network aggregation, the propagation of 
network public opinion is often sudden and explosive. With the rise of social networks, 
social media has become a great impetus to the formation of public opinion. By the end 
of 2015, the active users of Sina micro-blog have reached 236 million monthly, an 
increase of 34%. As one of the most rapid development of new media in recent years, 
Sina micro-blog becomes the main carrier of aggregation and outbreak of network 
public opinion due to its dual attributes of social and media. 

As a medium of propagation, each user of micro-blog is the publisher and com- 
municator of information. After the netizens release information, this information only 
needs to be forwarded through their huge fans group, and then, secondary forwarding 
again by the fans group and it will form the immeasurable reading quantity and 
influence. Micro-blog’s concern is based on similar preferences and habits, which 
makes the propagation of micro-blog’s superposition and resonance effect significant. 
And based on the event of public opinion in micro-blog, it is of great practical sig- 
nificance for the new media to study the propagation mechanism of the network public 
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opinion in micro-blog, and it has very important reference significance to the monitor 
and guide of network public opinion. 


2 Research and Design of the Propagation Mechanism 
of Network Public Opinion Based on Sina Micro-blog 


The information of Sina micro-blog spread mainly in the following ways: the release 
function, @ function, forwarding function and comment function [1]. These four ways 
can quickly and effectively release information, so that this information can spread in 
the Sina micro-blog and can be concerned by more users. The propagation of 
micro-blog’s network public opinion is a very complex process, and it is not a single, 
linear propagation mode. However, based on the sub-public communication formed by 
the users’ concern function, it develops into mass communication by the first level 
information receiver who forward information through forwarding function for the 
secondary spread, and then, it turns into a multi-level propagation mode with the 
traditional mass media platform. In this paper, we selected “A girl suffered attacks in a 
Yitel” as a case, and we collected data and tried to build the propagation analysis 
framework of micro-blog public opinion events to study the evolutionary mechanism of 
network public opinion. 


2.1 Event Description of “A Girl Suffered Attacks in a Yitel” 


April 5, 2016, the victim whose account name on micro-blog is Wanwan in her Sina 
micro-blog released a video of the attack in the hotel. Her unnerving story soon 
triggered a nationwide rage online. Through micro-blog’s comment function, some 
users hope that the party can release the details of the development process, some users 
asked if the hotel would give a statement and how to deal with it after calling the 
police, and there are users who want more people to see this vicious incident through 
forwarding it to give a warning to the female friends. Due to the spread of fast and large 
scale, the topic of #A girl suffered attacks in a Yitel# is quickly on the hot topic list of 
Sina micro-blog. During this period, @Reporter_hong-tao Xue has been forwarded 
Wanwan’s micro-blog to conduct a consultation on this issue and popularize relevant 
legal knowledge. @Yiteland @Homeinns Co., Lt issued a statement through 
micro-blog, and it said that they would to investigate this vicious incident and give an 
account of the party and the public. @Safety Beijing, as official micro-blog of the 
Beijing Municipal Public Security Bureau, has been tracking this matter. 


2.2 Data Collection and Processing 


Set the time segment of collection and retrieval from April 5 to April 6, 2016 through 
the acquisition of Sina micro-blog source data. We collected Sina micro-blogs and its 
forwarding data by the searching method of hot topic, a total of 1900 micro-blogs. We 
teased out the micro-blog’s forwarding relationship through Sina micro-blog’s for- 
warding rules and identified the Sina micro-blog theme users more than 500 people of 
“A girl suffered attacks in a Yitel”. We processed and calculated raw data and 
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propagation relationship between each node, and constructed the propagation network 
matrix of micro-blog public opinion of this event, this network matrix was 
anon-symmetric matrix of 502*502, and the data indicated the nodes’ frequency of 
emergence in the sample data. Based on the matrix, using UCINET software for 
visualization processing of the propagation network for this event (Fig. 1). 








Fig. 1. Network propagation diagram of “A girl suffered attacks in a Yitel” (Numbered Edition) 


2.3 The Design of Analytic Framework 


To analyze the relationship of propagation node of micro-blog public opinion, we need 
to measure the structure of the propagation network that is composed by nodes. Social 
network analysis can not only measure the relationship between nodes and other nodes, 
but also make the researchers more intuitively grasp the behavior of nodes by visu- 
alizing the interaction between nodes. 

For the micro-blog public opinion events, “A girl suffered attacks in a Yitel”, we 
designed the analytic framework, and it mainly included the overall network structure 
measurement and key nodes measurement. In the parameters that measure the overall 
network structure, the network density measures the compactness of network nodes 
through calculating the total distribution of each line and the difference between it and 
the complete graph; Clustering coefficients are used to reflect the structure of propa- 
gation network and the degree of aggregation between nodes. 

And in measuring the key parameters of the key nodes, the structure hole index 
gives us the ability of nodes to control the information received by other nodes through 
calculating the effective size, global constraints and grades; Intermediate centrality 
refers to frequency, and the frequency means that shortest path of other nodes are by 
the way of this node, and it shows the ability of a node to act as a medium; Intermediate 
centrality reflects the dependence of nodes on information flow, whereas closeness 
centrality is a measure of the point in a graph that is not subject to other controls 
(Fig. 2). 
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3 Experimental Measurement and Analysis 


3.1 Analysis of Key Nodes 


In the propagation of micro-blog public opinion of this event, @Wanwan_2016, as a 
party, released the first micro-blog about encountering attack in Yitel and it soon swept 
across this social media platform. The official micro-blog of Yitel released a statement 
on it. At the same time, @Safety Beijing also reported on this incident, and it was the 
first government micro-blog to report the incident. In addition, @Reporter_hong-tao 
Xue also played a more important role in the propagation and diffusion of micro-blog 
public opinion. Using the key nodes analysis algorithm to form a visual description, as 
shown in Fig. 3, the nodes in the Figure are significantly expressed. 
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Fig. 3. Key nodes diagram of “A girl suffered attacks in a Yitel” 


3.1.1 Analysis of Structural Holes 
The value of the contribution to the network of the two related people who exist 
structural holes between nodes can be accumulated. By measurement; the degree of 
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structural holes in the propagation network can be analyzed [2]. By using UCINET 
software, we analyzed part nodes of structural holes in micro-blog public opinion 
propagation network of “A girl suffered attacks in a Yitel’’, and we sort out the effective 
scale to get the Table 1. We funded that No. 1 “Wanwan_2016”, No. 2, “Ctrip’, 
No. 3 “Safety Beijing”, No. 5, “Homeinns Co., Ltd” and No. 4 “and “Yitel” ranked the 
top five. The value of effective scale reflects the position of nodes in the propagation 
network, and the bigger the value is, the more core the position becomes [2]. In 
addition to the effective scale, the limiting degree of these five nodes are relatively 
small, they are less than 0.2, and it also reflects that the five nodes are not easy to be 
controlled by other nodes and easier to access and spread public opinion. 


Table 1. Analysis of propagation network structural holes of “A girl suffered attacks in a Yitel” 




















(part) 
scale degree degree 
1 | Wanwan_2016 152.078 0.904 
2 | Ctrip 76.185 0.795 
3 | Safety Beijing 64.389 ; A 0.810 
5| Homeinns Co., Ltd 60.809 A 0.795 
4 | Yitel 56.717 l . 0.787 
6 | Reporter 19.286 l : 0.451 
hong-taoXue 
10 | Police Wang 9.090 ; ; 0.195 





7 | People outside the 8.825 ; l 0.273 
city 
27 | Update step by step 5.583 i . 0.065 








3.1.2 Intermediary Centrality 

Intermediate centrality is used to measure the ability of a node as a medium of prop- 
agation, that is, it measures the ability of the node as a “bridge” in the process of 
information propagation [3]. When a node appears on the shortest path between two 
nodes, the more the nodes appears, the higher the intermediate centrality of the node is, 
and this shows that more and more nodes have to spread information through it, and the 
node is also known as the “bridge node”. lg is the shortest path number from node j to 
node k, [jx(x;) is the shortest path number that contains the node i on the shortest path 
from node j to node k, N is the number of all nodes in a network. In a directed network, 
the formula for calculating the intermediate centrality is shown in Eq. (1). The results 
of intermediary centrality measurements of the propagation network are shown in 
Table 2. 


2 Li (Hi) / Ux 


Ca(xi) = (n—N(n—2) (1) 
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We can find that, “Wanwan_2016’, “Ctrip”’, “Safety Beijing’, “Homeinns Co., 
Ltd”, “Yitel” and “Reporter_ hong-tao Xue”, the value of intermediate centrality of the 
six nodes is located in the top six of the data sample, it means that the six nodes play 
the important roles in the propagation process of micro-blog’s public opinion and take 
on the more important roles of the medium. They not only have a greater share of 
public opinion information, but also a greater degree of impact on other nodes. It can be 
seen from the measurement results of intermediate centrality, the party discloses the 
vicious incident and triggers a large number of micro-blog users to forward the 
information, and it leads to the information propagation. In addition, the government 
micro-blog, the official micro-blog of media and other celebrities have a high inter- 
mediary, while the intermediary centrality of the other nodes is generally low, which 
also shows that the different types of nodes have different effects on the propagation of 
public opinion information. 


Table 2. Analysis of intermediary centrality of propagation network of “A girl suffered attacks 
in a Yitel” (part) 


User type Node Absolute intermediary | Relative intermediary 
centrality centrality 























1 | Party Wanwan_2016 143458.422 57.498 
2 | Companies involved | Ctrip 67290.070 26.970 
3 | Government Safety Beijing 64621.414 25.900 
micro-blog 
5 | Companies involved | Homeinns Co., Ltd 51395.484 20.599 
4 | Companies involved | Yitel 50695.699 20.319 
6 | Other celebrities Reporter_ hong-taoXue | 47793.961 19.156 
27 | Domestic consumer | Update step by step 7226.098 2.896 
23 | Domestic consumer | A glimpse of autumn 7142.234 2.863 


3.1.3 Closeness Centrality 

In the information propagation process of Sina micro-blog, a user may contact with 
many people in the network directly or indirectly. In this case, the user has a relatively 
high closeness centrality, that is, the user can close to a large number of other users in 
the network. It can be seen from the Table 3, the closeness centrality of Reporter_ 
hong-taoXue is highest in the propagation network of the micro-blog public opinion, 
and it means that this node contact with many people in the network, and it also can be 
found from the attributes of the node, the influence of the opinion leaders is more 
strong in the development of public opinion events. 

From the point of view of structural holes, intermediary centrality and closeness 
centrality, we analyze the key nodes of the propagation network of micro-blog public 
opinion event of “A girl suffered attacks in a Yitel”. From the view of calculated data, 
“Wanwan_2016”, as the party node, Homeinns Co., Ltd and Yitel, as the involved 
companies nodes, their effective scale are 152.078, 60.89 and 56.717, and their limiting 
degree are less than 0.2. And the three nodes are the starting nodes of microblog 
information in “A girl suffered attacks in a Yitel’’, so external nodes do not affect them. 
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Table 3. Analysis of closeness centrality of propagation network of “A girl suffered attacks in a 
Yitel” (part) 


Out closeness 


6 | Reporter_ hong-taoXue | 973.000 | 974.000 51:335 
1 | Wanwan_2016 1027.000 | 991.000 50.454 































3 | Safety Beijing 1105.000 | 1297.000 38.551 
28 | Legal Evening News 1203.000 | 1210.000 41.322 
2 | Ctrip 1250.000 | 1250.000 40.000 

5 | Homeinns Co., Ltd 1268.000 | 1269.000 39.401 
305 | Happy xiaoliumang 1279.000 | 1288.000 38.820 





23 | A glimpse of autumn | 1290.000 | 1258.000 39.746 


What’s more, the intermediary center degrees of “Safety Beijing”, “Reporter_ 
hong-taoXue” are 25.900 and 19.156 respectively, and they have relatively high 
closeness centrality, so it can be concluded that the two nodes are “bridge” nodes in 
network propagation of this micro-blog public opinion event, and they play the roles of 
“opinion leaders”, and they play leading roles in the propagation of public opinion 
information in this event and promote the transfer of public opinion information 
through their own influence. 


3.2 Overall Network Structure Measurement 


3.2.1 Network Density and Distance Between Nodes 

There is a direct correlation between the network density and the strength of the 
relationship between nodes, and the value of the network density will decrease with the 
increase of the number of nodes [4, 5]. The distance between nodes in the propagation 
network is mainly used to measure the network structure characteristics, and the shorter 
the distance between nodes in the propagation network is, it means that the connection 
can be established between the nodes through a shorter path, and the relationship 
between the nodes is close, and the propagation network has a strong cohesion and the 
information in the network can spread rapidly. 


Table 4. Measurement results of network density and distance between nodes of “A girl 
suffered attacks in a Yitel” 


Density | Standard Average Distance-based Distance-weighted 
deviation distance cohesion fragmentation 


0.0102 0.1018 3.193 0.344 0.656 


According to the statistical results in Table 4, the network density of this public 
opinion event is only 0.0102, the result shows that the network density is very small in 
the propagation process of the public opinion events, the links between nodes are more 
dispersed, and the exchange of information is not frequent. The distance between nodes 
of the propagation network is 3.193, cohesion distance-based is 0.344, and fragmen- 
tation distance-weighted is 0.656. This result shows that the propagation ability of the 
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public opinion information in data samples is general and cohesion is not strong. The 
propagation probabilities of distance from | to 4 are 1%, 26.1%, 25.4% and 47.4%, so, 
the propagation ability of sub-public propagation is general, however, the relationship 
links whose distance is 4 are nearly half, and it also shows that the mass communi- 
cation has a huge effect in this event. 


3.2.2 Clustering Coefficient 

The clustering coefficient is mainly used to reflect the characteristics of the topology 
structure of the propagation network and the aggregation degree of the relationships 
among the nodes [6]. If a node 7 in the network has n; neighbor nodes, then there may 
be a maximum of n;(n; — 1)/2 edges between the n; nodes. We define the ratio of 
actual number of edges E(j) between nodes n; and the possible number of edges 
nj(n; — 1)/2 as the clustering coefficient of nodes CC(j), that is: 


CCU) = 2EU)/|nj(nj — 1) (2) 


The clustering coefficient CC of the whole network is the average of the clustering 
coefficients of all nodes j: 


SCCO) 
cc = (3) 


The value of clustering coefficient ranged from O to 1, the greater the number of 
clusters is, the stronger the cohesive force of the whole propagation network is, and the 
links between nodes are more closely. The measure results in Table 5 show that 
efficient clustering coefficient is 0.1, in propagation network in the public opinion event 
of “A girl suffered attacks in a Yitel”. The clustering coefficient is too small, and it 
means that there is a relatively low level of information communication between nodes. 
At the same time, the network structure is relatively loose, the community structure and 
the state of the internal substructure are not obvious, and the connection between the 
nodes is relatively weak. 


Table 5. Analysis of clustering coefficient of propagation network of “A girl suffered attacks in 
a Yitel” (part) 


Clustering coefficient | Node logarithm 


Wanwan_2016 19503 
Ctrip 4851 
Safety Beijing 3655 
Yitel 2850 
Homeinns Co., Ltd 3240 
Reporter_hong-taoXue | 0.138 351 
People outside the city 91 























Weighted Overall graph clustering coefficient: 0.100. 
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In this paper, we use three indexes, the network density, the distance between nodes 
and the clustering coefficient [7—9], in order to analyze the overall network structure of 
the micro-blog public opinion of “A girl suffered attacks in a Yitel”. According to 
measure results, the network density is very small in the propagation network of the 
public opinion events, it is only 0.0102, the links between nodes are more dispersed, 
and the exchange of information is not frequent. The propagation ability of sub-public 
communication is general; however, the relationship links whose distance is 4 are 
nearly half, so it also shows that the mass communication has a huge effect in this 
event. In spite of this, the cohesive force of propagation network is weak and lack of 
communication between nodes and nodes, and the forwarding relationship does exist, 
but it did not communicate with other nodes for effective communication after the 
forwarding, making public opinion propagation effect cease to advance. For this 
micro-blog public opinion event, information is more concentrated on the party and the 
company involved, “Wanwan_2016” and “Yitel”, and they are the key communicators 
of this event. For “Safety Beijing”, “Reporter_ hongtaoXue” and other key nodes, etc., 
they only have single forward relationship rather than mutual forwarding relationship, 
and the connectivity between nodes is poor, so it is difficult to form a small group. 


4 Conclusion 


In this paper, we measured the overall network structure that are composed of nodes and 
the key nodes of the propagation network through statistics and arrangement on the 
relationship of micro-blog public opinion propagation node. Through the analysis of the 
measurement results, it is found that it is difficult to form a small group because of the 
lack of cohesion between the nodes of this event. Investigating its deep reason, the event 
of “A girl suffered attacks in a Yitel” is a vicious incident caused by the individual party, 
and in the information propagation of micro-blog public opinion, it is focused on the 
hotel’s indifference to violence that complained by the party, companies involved also 
issued a statement and it said they would investigate it to the end, in order to give the 
party and the public an account. Although the forwarding quantity in the short term of 
this incident is very high, and it also appeared the “bridge” node as the “opinion leader”, 
and there is a lack of cohesion between the nodes of the propagation network, and there 
is no small group, and the communication between the nodes is weak. 

Therefore, the parties of public opinion events should actively enhance commu- 
nication with other users, and it needs more media, celebrities and government 
micro-blog to pay attention to public opinion events so as to promote the transfer of 
public opinion information through their own influence. On the one hand, the parties 
should strengthen the communication with other users and select the representative 
users to forward their questions for each other after forwarding their own micro-blog. It 
will not only form a small group based on forwarding, but also can enhance the 
cohesion between the user nodes, so that public opinion information can spread rapidly. 
On the other hand, we can develop opinion leaders to achieve the effect of guiding 
public opinion. The “opinion leader” in Sina micro-blog refers to a kind of commu- 
nicator that has appeal and influence in the propagation process of network public 
opinion. This speech of information communicator is very easy to cause the recognition 
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of other users, and the users use comment and forwarding function to make it spread in 
the micro-blog, resulting in a greater impact on public opinion, which has a greater 
impact on public opinion. Therefore, we need to have the opinion leaders to exert their 
influence and spread positive energy, in order to achieve the purpose of infects and 
affects other user groups. 
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Abstract. A scholar-recommended model based on community division is 
established due to the characteristics of social intercourse of academic social 
network. The model was developed by GraphChi, the single version of large-scale 
graphic computing system which was launched by GraphLab, to find the core 
network in parallel on network topology map. In the established network, using 
self-adaptive label transmission to create labels and then according to the number 
of labels on the nodes to get final results of community division. Calculation is 
done within the community for expert recommendation services. The experiment 
of data-set on academic social networking platform, SCHOLAT, suggests, 
models not only can create community quickly, but also can gain good recom- 
mendation results by all the three personalized methods, 1.e. Community Weight 
Recommended (CWR), Community Random Recommended (CRR) and 
Acquaintance Community Recommended (ACR). 


Keywords: GraphChi - Community detection - Kernel sub-network 
Label propagation - Scholars recommendation 


1 Introduction 


In recent years, with the advent of WeChat, micro-Bo, Facebook and other social 
networking platform, large-scale social networks have developed rapidly. The greatest 
charm of a social network is Social, and each social networking platform creates many 
explicit or hidden circle of friends [1]. With the rapid development of intelligent termi- 
nals, a rapid growth of network data has been seen in social networking platform, so 
people often encounter the problem of information overload when they find interested 
friends. Therefore, the provision of friends recommended in social networking platform 
can help users more quickly and more accurately find their potential friends. As a kind 
of social platform, academic social networking platform also need a scholars recommend 
system to help users brush selected interested scholars. 

At present, the domestic and foreign scholars have proposed many friend recom- 
mendation algorithm to solve the mentioned above problems, which can be divided into 
two categories. One is the friend recommendation algorithm based on the user’s existing 
information. In the literature [2], a friend recommendation model is established by 
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obtaining user’s preferences and information that already exists on the social network 
platform, and the model explores and recommends friends who may have intersects with 
the user on the view of user’s existing information. The algorithm in [3] analyzes the 
microblog content published by the user to discover the interest of the user, and then 
according to the user’s interest to recommend it may be interested friends. The paper [4] 
aimed at some friends recommend algorithms, but ignored the relationship between 
individuals and timing factors, this paper proposed a method to model the temporal 
behavior among users based on the extraction of the user’s existing information, so they 
could found the neighbour collection that have the greatest influence on the current user, 
and then merged the collection in-to a coordinated Itering recommendation algorithm 
which based on probability matrix decomposition. In the literature [5], the non-topology 
information were used to calculate the similarity among users, and then recommended 
the potential friends who were similar to the target user according to the similarity 
between users. Another friends recommendation algorithm was based on social network 
topology modeling. In the literature [6], the separation degree of each node was calcu- 
lated by dividing the topological graph of the buddy relationship in the social network. 
Based on the degree of separation, the algorithm divided the nodes with the same degree 
of dissimilarity into the same community group, and then recommended nodes within 
the same group reciprocally. 

Considering the unique social nature of academic social networks [7—9], we propose 
a scholars recommendation algorithm based on label community discovery, which is a 
friend recommendation algorithm based on social network topology modeling. The 
Label Propagation Algorithm (LPA) [10] proposed by Raghavan et al. in 2007 firstly, 
which is acommunity discovery algorithm based on label propagation. LPA is relatively 
simple. The algorithm first assigns a unique label to all the nodes, and then flushes the 
labels of all the nodes until the convergence requirements are reached, and finally get 
the non-overlapping community. In recent years, many scholars [1 1—13] have improved 
the LPA algorithm from different angles in view of its simple and high efficiency. For 
the LPA can only be applied to the discovery of non-overlapping community problems, 
Gregory proposed the Community Overlap Propagation Algorithm (COPRA) [14] algo- 
rithm in 2009, which was applicable to overlapping communities. Concerning the 
problem that community detection algorithm based on label propagation in complex 
networks has a pre-parameter limit in the real network and redundant labels, LI Chunyin 
proposed the Adaptive Label Propagation Algorithm (ALPA) [15]. The algorithm used 
the adaptive threshold to eliminate the unreasonable label in the iterative process, and 
finally classified the node with the same label into a community. ALPA algorithm has 
achieved good results, especially in the academic social network. But itis not ideal when 
dealing with social networks with large user population and complex user relationships, 
because the time complexity of the algorithm is relatively high in this case. To solve 
this problem, this paper proposes Community Detection Based on GraphChi (CDBG) 
algorithm which is a community discovery scholars recommendation algorithm based 
on GraphChi system. CDBG uses the characteristics of GraphChi system to achieve 
ALPA algorithm in parallel, and then uses the community clustering results to achieve 
friend recommendations. This algorithm improves the response speed of friends recom- 
mendation. 
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2 Recommended Approach 


The main idea of the model is to explore the community according to the topology 
diagram of the user relationship in the social network, and then recommend a friend in 
the community. Finding out core networks and then building the community by tag 
propagation is a better-performing and easy-to-implement approach. However, it is very 
time-consuming to find the core network in the massive graph data, which makes the 
recommendation system unable to meet the user’s psychological response time. There- 
fore, it is very important to improve the search strategy of the core network. The strategy 
of this paper that to use the concurrent mechanism of Graphchi system to split the 
topography of the buddy relationship, and then find out the core network from each part. 
Each local topology to find out the core network. As the number of local nodes is much 
smaller than the whole, it improves the speed of finding the core network, thus improving 
the response time of the whole recommendation system. 


2.1 GraphChi 


The Graphchi system is a stand-alone version of the large-scale graphing system intro- 
duced by GraphLab Labs. Although it is not based on the popular distributed architec- 
ture, the efficiency of this system is very high, and it is not inferior to the distributed 
computing system [16] in dealing with large-scale data. The ingenious design allows 
Graphchi to efficiently process large-scale map data. In order to speed up the efficiency 
of the system, Graphchi load the data into memory. This approach is similar to other 
large scale computing systems. Since the memory is very limited in stand-alone envi- 
ronment, it is impossible to load all the data into memory when it encounters large graph 
data. So Graphchi first cut into a number of small map, and then load the sub-graph data 
into the memory for each calculation. Loading each subgraph one after the other can 
complete an iteration. After several rounds of iterations, you can complete the task of 
graph calculation. The above process is the main idea of Parallel Sliding Windows 
(PSW). 

The running process of PSW has completed the update of the sub graph data. Each 
round diagram update process includes three main steps: the loading of sub-graph data, 
the concurrent implementation of node update function, write back the updated sub-map 
data. When Graphchi completes the three sequential steps mentioned above, it completes 
the update of a subgraph. When all the P intervals are updated, a round of iteration of 
the entire graph is completed. 

As can be seen from the above algorithmic flow, the key of Craphchi to handle 
massive amounts of data in a single-machine environment is the data is sliced and then 
stored in memory. Because only need to read a piece of information at a time, so long 
as the external memory space is large enough to be able to adjust the value of P to make 
the memory to meet the computing requirements. Graphchi uses a vertex-centric 
programming model in which adjacent vertices pass messages between edges. Devel- 
opers only need to consider a single vertex update function, and Graphchi framework 
to solve specific details such as vertex-parallel. 
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2.2 Community Detection Based on GraphChi (CDBG) 


2.2.1 Establish Core Network 

Definition 1: Complete graph. The topological graph of social network user relations 
is represented by graph G = {V, E}, where V represents the set of user nodes in the 
topology graph. FE is the edge set of the topology graph, which represents a collection 
of friend relationships among users. If graph G, is a subset of graph G and any pair of 
different vertices in G, have exactly edges connected, then G, is called a complete graph. 


Definition 2: Core network. In graph G = {V, E}, assume that A and B are any two 
nodes in V that are not tagged. If A and B are the maximum degree and unmarked nodes 
of each other, then the edges A to B are used to find the complete graph for the initial 
edge. If G, € G and there is no any complete graph G, E G making G, € G, then G, is a 
core network of graph G. 


Definition 3: Community consolidation. There are two sets of G; and G; representing 
two different communities. G; is a subset of G; that represents the community G; contains 
the community G;. The behavior of deleting the community G; is referred to as merge 
communities G; and G;. 

The core network is the core unit of the community, and the nodes in the same core 
network must be in the same community. Because this algorithm is based on the core 
network of relationships to build the community, so the algorithm must first look for the 
core network. This is the initialization phase of the algorithm. Specific steps are as 
follows. 


1. Assign 0 to all nodes in V; 

2. Select a node U; with value of 0 in the data fragment, then find the node U; with the 
largest degree and value of O in the adjacent node of the node U;; 

3. Ifthe value of node U; is 0 and U; is the adjacency node with the largest degree of 
U,, then choose e (U;, U;) as the initial edge to find the core network according to 
Definition 2; 

4. If the condition of step 3 is satisfied, then the node number with larger degree in U; 

and U; is regarded as the core network number and assigned to all the nodes in the 

core network found in step 3; 

If the condition in Step 3 is not true, go back to Step 2 and select another node. 

6. Repeat steps 2—5 until there is no change in the core network, and the process of 
finding the core network is stopped. 


A 


According to the above-mentioned rules, the network topology diagram shown in 
Fig. 1 is taken as an example to illustrate the process of establishing the core network. 
According to the above process, a total of 2 core networks are found in the topology 
map, which are the core networks of 4 and 7: M, = (3, 4, 2, 5), M7 = (6, 7, 8). Finally, 
the core network assigned to all the nodes, that is, 4 is assigned to the node 3, 4, 2, 5, 
and the 7 is assigned to the node 6, 7, 8. In this algorithm, the node’s value is the label 
of the node. Figure | shows the topology which has established the core network. 
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Fig. 1. After the establishment of the core network 


2.2.2 Label Update 

When the core network is established, each node in the network is assigned a unique 
value. The value on the node is the label of the node, which is the original label. The 
value is also the number of the core network to which it belongs. The core network is 
the influential circle of friends in the community, and the core of the topological rela- 
tionship of the community is the core network. Therefore, the algorithm spreads the 
label with the core network as the center. In order to reduce the redundant tags and avoid 
the excessive spread of tags and improve the stability of the algorithm, the algorithm 
automatically removes some labels with less weight in the process of label propagation, 
so as to self-adaptively propagate the notes. Specific rules are as follows. 


1. The weight of each node in the core network is set to 1, which corresponds to the 
original label of the node; 
2. In each iteration, the label of any node v in V is updated with the following rules: 

(a) The nodes v check the labels of it’s neighbors one by one. If v node and its 
adjacent node v, have the same label L;, update the label in v: L; = W, + (W,/d). 
Where W, is the weight of the label L, in v, W; is the weight of the label L; in 
the adjacent node, and d is the degree of v. If the adjacent node v, has the label 
L, but the node v does not have the label, then label v as L = W,/d. Where W, 
is the weight of label L, in v,; 

(b) Assume that the number of labels is C. If C is greater than 1, the label with 
weight less than 1/C is deleted. If all the label values satisfy the condition, the 
label with the largest weight is retained and the remaining labels are deleted. If 
the weight of the largest have more than one label, you can keep a random; 
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(c) Normalize the labels in node v, and finally make the sum of the weights in each 
node equals to 1. 
3. Repeat steps 2—4 until all nodes have at least one label; 
4. Traverse all nodes and classify nodes with the same label as the same community. 
If a node has more than one label, it is categorized into multiple communities; 
5. Community consolidation based on Definition 3. 


According to the above rules, now take the Fig. | as an example to update the label. 
Eventually generated two communities, namely C, = {1, 2, 3,4, 5}, C2 = {6, 7, 8}. The 
results of community detection are shown in Fig. 2. 





Fig. 2. Results of community detection 


2.3 Scholars Recommend 


In a social network, users in the same community tend to have similar interests or areas 
of work, and users in the same community are more likely to accept it. Therefore, the 
recommendations in this paper are carried out in the divided communities. The CDBG 
algorithm is used here to classify communities. After the community division, this model 
uses three methods to make personalized recommendation, namely: Community Weight 
Recommendation (CWR), Acquaintance Community Recommendation (ACR) and 
Community Random Recommendation (CRR). 

After the establishment of the scholar community using the CDBG algorithm, each 
scholar may belong to several different communities. In the CWR recommendation 
mode, all the scholars who are in the same community as the target scholar A and not 
the friends of A are sorted by weight from high to low. Then the former N highest weight 
scholars recommended to the scholar A, which is aC WR recommendation. This recom- 
mendation is suitable for recommending scholars to Ph.D. and professors because we 
assume that highly educated user groups are more likely to accept influential scholars 
in the same community. The ACR recommended method selects the community with 
the least number of nodes among the communities to which the target scholar A belongs, 
and ranks the users of the community in descending order of weight, and then recom- 
mends the first N highest weight scholars to the scholar A. This recommendation is 
recommended for scholarships for Masters and below, as these users have limited 
communication and academic skills and are more willing to accept scholars in acquaint- 
ance communities. For example, a college counselor is more willing to accept the college 
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clerk and other college counselors in her acquaintance community than a well-known 
scholar in a field. CRR recommended method is the most simple. The CRR randomly 
recommends nodes that are in the same community as Target A and whose weights are 
higher than the specified value. This approach is suitable for those who have just joined 
the academic social networking platform and friends and team information is scarce. It 
is a cold start recommended way. Take the CWR method as an example, the process of 
Scholar Recommendation is illustrated below: 


| Input 
G={V,E}:G represent the initial community topology, 
A: A represent the ID of the target scholar, 
N: N represent the number of recommended scholars, 
2.Recommended procedure 
(1)Community Detection(CDBG) 
foreach iteration do 
shards| |<-leftarrow InitializeShards(P) 
for interval <- 1 to P do 
subgraph <-LoadSubgraph(interval) 
parallel foreach vertex in subgraph:vertex do 
CDBG _updateVertex (vertex) 
end 
shards[interval].UpdateFullyQ 
for s in 1,...,P, s != interval do 
shards[s].UpdateLastWindowToDisk() 
end 
end 
end 
(2)Scholars Recommend 
foreach iteration do 
shards| |<-InitializeShards(P) 
for interval<-1 to P do 
max_n (vertex. edges) 





end 
end 
3.Output 
1,2,3,...,.N, The top n users with the highest 
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3 Experimental Results 


In this part, we first introduce the data set used in the experiment, and then explain the 
evaluation method of the experiment. Finally, we give the experimental results of CWR, 
CRR and ACR, and analyze the experimental results. 


3.1 Description of Dataset 


The data set used in the experiment is the friend relationship data set of the academic 
social network platform (SCHOLAT) on October 12, 2016. The data set records the 
relationship between scholars and friends on the social network platform. After 
denoising the data set, there are 5168 users whose information is disclosed and 22284 
user relations. 


3.2 Evaluation Method 


In the experiment, we used four general indicators [17] to evaluate the recommended 
results, namely: Precision, Recall, F-measure and MAP. These four methods are the 
standard to measure the accuracy of the recommendation system, and reflect the accu- 
racy of the recommendation system to the specified users, where Precision is the accu- 
racy of the recommended system recommendations, and it can be defined as follows: 


lw, 
Poa hy (1) 


Where T is the number of experimental test samples, and N, is the number of recom- 
mended objects the user likes in each recommendation, and L is the recommended list 
length in a recommendation. 

Recall expresses the possibility that the user’s favorite object is recommended by 
the recommended system, and it can be defined as follows: 


DN, 
R=~y— 
7 LA, (2) 


Where A, represents the total number of objects in the test set that are accepted by 
the recommended user. 

F-measure is the weighted harmonic average of Precision and Recall, and when the 
value is high, the test result is more effective. It can be defined as: 


B (a +1)P*R (3) 
on a2(P + R) 


When a = 1, the formula is the most common F,-measure, and it defined as: 
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_2*Px*R 


1" P+R Ge 


The final evaluation criterion is MAP, which reflects the average accuracy of the 
recommended system and 1s the probability of recommending an object to be accepted. 
it can be defined as: 





MAP = - y (> Ps! ) (5) 


j=1 \ k=l J 


Where p(L;,) represents the ratio of the number of users preferred in the first K 
recommended objects in the jth recommendation to k. j represents the number of objects 
that the user likes in the jth recommendation. L represents the length of the recommended 
list, and we get L = 10 andL=5. 


3.3 Result Analysis 


In the data set described above, we use the community recommendation model proposed 
in this paper to generate a total of 442 communities, with the largest community having 
1986 user nodes and the smallest community having 3 nodes. The total time it takes to 
generate the community is 1.508 s. In order to verify the validity of this model and 
compare the advantages and disadvantages of CWR, CRR and ACR recommendations, 
we calculate the Precision, Recall, F-measure and MAP by questionnaire survey. These 
questionnaires are divided into two categories, one with a recommended length of 10 
and another of 5; Table | gives the experimental results for L = 10 and L = 5. From 
Table 1, we can see that no matter which kind of recommendation method, the recom- 
mended accuracy rate is above 55%, and the average accuracy MAP can be maintained 
above 55%. Additionally, when the ACR recommended method is adopted, the accuracy 
can reach 77.68% and the MAP can reach 81.63% under the recommended length L = 10. 
In this way, when the recommended length L = 5, the accuracy is 84%, MAP is 85.84%. 
This is a very good result, and at the same time, it proves the effectiveness of the proposed 
model in this paper. 


Table 1. Comparison of experimental results. 


Description | L = 10 


MAP 
CWR 0.5141 0.5634 
CRR 0.5585 0.6228 
ACR 0.6089 0.8163 
Description |L=5 

MAP 
CWR 0.4935 0.5709 
CRR 0.5445 0.6359 
ACR 0.6360 0.8584 
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By comparing the results in Table 1, it can be seen that the CWR recommendation 
method has the worst performance among the three methods. There are two reasons for 
this situation. First of all, in the scholar network social network platform, the weight of 
a scholar is not exactly the same as the scholar has great influence, because some scholars 
often use the scholar network platform for teaching, thus accumulating a large number 
of students friends to increase their weight, but this weight does not affect the choice of 
other people; In addition, a highly qualified scholar does not guarantee that he is well 
known by other scholars, nor can he guarantee that the direction of his research is to 
accept the recommendation of users interested in. And the majority of users of social 
networking platform is a master’s degree, so the user’s academic social and research 
areas are relatively limited. So the CWR method is more suitable to recommend scholars 
to doctoral and professors and other highly educated users. As can be clearly seen in 
Table 1, the ACR method is most effective. The reason for this result is that there is a 
high possibility that users have a connection in acquaintance communities, so the user 
is more inclined to accept a scholar in the community. In addition, by comparing the 
two results of L = 10 and L = 5, it can be seen that there is no significant difference in 
the effect of CRR between the two lengths. However, CWR perform better at L = 5, but 
ACR at L = 10 gives better results. This just verifies the above analysis. 


i 0.6 
i 0.5 
= 0.8 mcwR | _ 04 m CWR 
2 0.6 3 0.3 
z ECRR |£ E CRR 
~ 0.4 0.2 
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Fig. 3. Results of community detection 


Figure 3 shows the change in the recommended results with the length of the recom- 
mended list. As can be seen from the figure, Recall and F,-measure are more sensitive 
to the recommended list length, while Precision and MAP are less affected by the 
recommended length. In addition, it can be seen from the diagram that the ACR method 
can achieve better results under different recommended lengths. It can be seen that the 
use of ACR recommended in many cases can get the best results. But the ACR method 
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obviously has his shortcomings. The acquaintance community recommendation can 
improve the acceptance of the recommendation object by the user, but such recommen- 
dation lacks novelty and unexpectedness, and the coverage rate is low, so that the user’s 
satisfaction can not be guaranteed. WRR also has these deficiencies. CRR is not the best, 
but its randomness is high, and can provide enough freshness and accident, and have a 
higher coverage rate. In summary, if the user is a doctor or professor, you can choose 
CRR and CWR mode. If you want to recommend scholars to the MS and the following 
user groups, the best choice is ACR. 


4 Conclusion and Discussion 


In this paper, we propose a community-based recommendation model for large-scale 
social networks. The model firstly uses the graphchi framework to divide the community 
and obtains the core network of the network structure. Then, the core network is used 
as the core to transmit the tags. The nodes are updated by using the synchronous adaptive 
label propagation mode. Eventually forming multiple communities and recommending 
scholars in these communities. In the stage of recommending buddies, this paper 
provides three methods, which are community weight recommended (CWR), 
community random recommended (CRR) and acquaintance community recommended 
(ACR). Through the analysis of the experiment, we can see that the overall effect of 
CRR is the best, and the effect of ACR is the second. 

The lack of the CRR approach in the recommended scholarly stage is precisely the 
advantage of the ACR. In the next work, we intend to study how to combine the two 
recommended ways to further improve the effectiveness of recommendations. 
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Abstract. To enhance the sense of reality of Unmanned Aerial Vehicle (UAV) 
simulation, factors influencing fuel consumption of UAV are analyzed in this 
paper in detail. On the basis of these factors, a fuel consumption model of UAV 
is presented, and finally fuel consumption simulation in UAV simulation control 
system could be implemented employing the presented model. 


Keywords: Unmanned Aerial Vehicle - Fuel consumption - Simulation model 
Energy balance 


1 Introduction 


Technologies concerning Unmanned Aerial Vehicle (UAV) areas developed signifi- 
cantly in recent years, and UAV simulation appears great theoretical and applied impor- 
tance [1]. The instrument monitoring areas display the flight status of UAV during 
missions in real-time, the areas contain a series of data, e.g. fuel consumption, propeller 
speed, propeller temperature, link state, etc. The series of data is of great references to 
estimate whether the UAV is in normal flight state. As important one of instrument 
monitoring areas, fuel gauge displays the fuel consumption level of UAV in real time, 
which is extremely valuable for flight status monitoring and flight mission planning [2]. 
A fuel consumption model of UAV on the basis of detailed analyzed influencing factors 
in Sect. 2 is presented in Sect. 3. Fuel consumption simulation of instrument monitoring 
areas in UAV simulation control system could be implemented employing the presented 
model. 


2 Influencing Factors Analysis of Fuel Consumption of UAV 


2.1 Fundamental Aerodynamic Parameters 


Aerodynamic parameters contain lift coefficient, drag coefficient, lift-drag ratio, etc. 
Fundamental aerodynamic parameters of UAVs with different types or different 
performance would vary as temperatures and speeds differ. However, the correlations 
among those fundamental aerodynamic parameters are complex and could hardly be 
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described in functions, and thus are measured by wind tunnel test or test flight, which 
are shown in Figs. | and 2. 


1.2 


X-Aircraft elevation Y-Lift coefficient 


Fig. 1. Relation curve of aircraft elevation and lift coefficient 


X-Machnumber Y-Drag coefficient 


0.2 0.4 0.6 0.8 1 1.2 


Fig. 2. Scatter plot of Mach number and drag coefficient 


2.2 Engine Performance Parameters 


The performance of engine is linear with respect to the fuel consumption; the fuel 
consumption of engine is determined by UAV type, speed and altitude. One of the most 
important indexes is the Thrust specific fuel consumption (TSFC) [3]. The TSFC of 
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engine, which is different from fundamental aerodynamic parameters, is a function of 
speed, altitude and thrust. 


2.3 Flight Trajectory 


The flight trajectory of UAV is shown on Head up display (HUD) [4], which includes 
speed, altitude, pitch angle, roll angle, deflection angle, acceleration, course, etc. These 
indexes are all factors influencing fuel consumption. 


2.3.1 Speed 

The speed to be considered here is the level flight speed of UAV. There are two speed 
units in aviation field: the commonly used km/h, and the Mach number. Speed relates 
directly to fuel consumption of UAV. 


2.3.2 Pitch Angle 

The fuselage of a UAV would be at a certain angle to horizontal while the UAV is 
ascending or is descending. The UAV is nosing up during ascent, and the angle which 
is designated positive is defined as elevation angle; on the contrary, the UAV is nosing 
down during descent, and the angle which is designated negative is defined as depression 
angle. When ascending, greater thrust is required to overcome the gravity and to generate 
upward acceleration, as a consequence, the fuel consumption would increase; when 
descending, the thrust reduces, the fuel consumption decreases accordingly [5, 6]. 


2.3.3 Roll Angle and Deflection Angle 

The UAV would subject to centrifugal force, gravity and drag, and would tilt at an angle 
when it makes a turn. The tilt angle is defined as roll angle. Turning radius and turning 
rate of an aircraft are determined by the roll angle. The deflection angle is the course 
change caused by the nose of a UAV when turning. More thrust is needed to maintain 


X—Deflection angle Y—Roll angle 


6 
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Fig. 3. Scatter plot of roll angle and deflection angle 
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circling when the UAV makes a turn. Figure 3 shows a series of roll angles and deflection 
angles sampled in a turning process. 


2.3.4 Acceleration 

The acceleration of an aircraft consists of horizontal acceleration and vertical acceler- 
ation. Figure 4 shows scatters of vertical acceleration and fuel consumption sampled 
during UAV missions. 


X—Vertical acceleration Y—Fuel consumption 


Fig. 4. Scatter plot of vertical acceleration and fuel consumption 


2.4 Weight 


The weight of a UAV is the sum of fuselage weight and load weight. The heavier the 
UAV weighs, the greater thrust is needed, accordingly the fuel consumption increases. 
When in a UAV mission, the UAV weight would reduce as the fuel consumption 
increases. In fuel consumption calculation, therefore, the flight process should be divided 
into several stages, weight deduction in each stage is neglected, and thus simplified 
calculation is proceeded minimizing calculation error in accordance with aforemen- 
tioned assumption. 


2.5 Atmospheric Parameters 


Atmospheric parameters mainly include temperature, density, pressure, etc. These 
parameters are evenly distributed on horizontal surface, but are with considerable varia- 
tion on vertical surface. Temperature and pressure are main factors influencing fuel 
consumption; standard atmosphere would be utilized to execute simulation in aviation 
field. 
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3 Computational Model for Fuel Consumption 


Fundamental aerodynamic parameters are complex variables in factors influencing fuel 
consumption discussed in Sect. 2: firstly, aerodynamic parameters differ as UAV types 
are different; secondly, the correlations between different aerodynamic parameters could 
hardly be described in functions. Hence, aerodynamic parameters won’t be considered 
in fuel consumption modeling in this paper. 

Because of the linear relationships between engine performance parameters and 
TSFC (relative with speed v, altitude h and thrust F), engine performance parameters 
could be used to model fuel consumption. 

Flight trajectory parameters relate most directly to the fuel consumption, speed, alti- 
tude, pitch angle, roll angle, etc. could be all used to model fuel consumption, and 
therefore we further involve these parameters into the fuel consumption modeling. 

Since the accurate acquisition and uncomplicated calculation of the fuselage weight 
and the load weight of a UAV, we also involve the total weight of a UAV as one of 
parameters into fuel consumption modeling. 

Normally, atmospheric parameters have certain influence on fuel consumption. 
However, due to the little critical difference of the altitude in level flight, atmospheric 
parameters have little relationship to fuel consumption modeling; hence atmospheric 
factor won’t be considered in fuel consumption model establishment. 

The following assumptions are presented based on principle of energy balance before 
fuel consumption modeling: 


1. The weight change caused by fuel reduction in UAV mission doesn’t affect the fuel 
consumption calculation; 

2. The accelerated motion of UAV is deemed as uniformly accelerated motion, i.e. the 
variations of speed and altitude are linear; 

3. Energy generated from fuel consumption is all for UAV motion, ignoring energy 
supply for other components at the same time. In fact, energy for other components 
is provided by battery in UAV; 

4. A complete UAV mission includes 5 stages, 1.e. taxiing, climbing, cruising, landing 
and taxiing. Though the fuel consumption in each stage is different, since taxiing, 
climbing and landing make up small proportions of time throughout the mission, 
only fuel consumption in cruising would be taken into account when modeling. 


Model description of fuel consumption of UAV on vertical surface employing the 
principle of energy balance is shown in Fig. 5. 

Let altitude be h, speed be v, acceleration of gravity be g, the UAV mass be m;, the 
load mass be mp, thrust generated from engine be T, aerodynamic drag be f, lift be L, 
pitch be a, roll angle be y, and abscissa and ordinate of the UAV be x and y. Then the 
following equation set could be set up 


mx =Tx cosa — f — (m, +m )g x sin y (1) 
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Fig. 5. UAV fuel consumption model on vertical surface 


d 

mxvx = =Tx sina + L-— (m, +m,)g x cos y (2) 
DO (3) 
— — V 
dt Y 
dx 
— = yX cos 4 
Pa Y (4) 


The energy height of the UAV is following 
B=h+—xv 5 
2 g (5) 


The total energy equals to the product of the energy height and the UAV mass, thus 
the state model based on principle of energy balance could be obtained as follows 


E=h+ ar. 
2g 
_ vX(T x cosa-f) (6) 
= = 


x=VX cosy 
The formula of fuel consumption Mf is as follows 
dMf = C,Pdt (7) 


where C, is the TSFC of UAV; P is the rated thrust of engine. 
Therefore, the formula of fuel consumption and energy height is as follows 
dE  dE/dt  dE/dt ; 
dMf dMf/dt  C.P (8) 
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where the TSFC of engines varies according to aircraft types: the TSFC of an airliner is 
generally 0.1—0.5; the TSFC of a fighter is generally 0.7-1.5; and the TSFC of the UAV 
in this paper is about 0.1. Fuel consumption of a UAV in cruising stage could be derived 
from formula 3.6 and formula 3.8. 


4 Conclusion 


Factors influencing fuel consumption of UAV are analyzed in detail above all in this 
paper, and then on the basis of these factors, a fuel consumption model of UAV is 
presented. Employing the presented model, fuel consumption simulation in UAV simu- 
lation control system could be implemented. 
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Abstract. This paper proceeded with simulating the meteorological data. Based 
on the meteorological data that has been collected, this paper puts forward the 
simulation method where the meteorological data can be classified into discrete 
data and continuous data which can be simulated respectively. By establishing the 
mathematical model of meteorological variables and the employment of the rele- 
vant knowledge, the simulation of meteorological data can be achieved. The 
simulated relationship among meteorological data accord with the relationship 
among the real data. In other words, the simulated result approach the real mete- 
orological data to a certain extent. Moreover, this result guarantees the application 
of the unmanned aerial vehicle in logistics, disaster relief, medical care etc. 


Keywords: Meteorological data - Simulation - Mathematical model 
Discrete 


In UAV simulation training, the weather is a very important factor, Changes in weather 
will greatly affect the operation and flight of UAV in simulation environment. In order to 
improve simulation level of the simulation training system, the simulation of meteoro- 
logical data needs to be studied. The objective of Ref. [1] was to develop and validate a 
simulation model of the evaporation rate of a Class A evaporimeter pan. Daily weather 
data and three climatic variables for four cities in China were gathered, investigated and 
analyzed, and the sensitivity of climatic variables on building energy consumption was 
discussed [2]. An experimental embankment was constructed, and the first objective is to 
investigate the influence of climatic changes on the soil response such as changes in water 
content and temperature [3]. The main contribution of Ref. [4] is an analysis of 
requirements for a wind power infeed model used in a power system simulator from a 
meteorological viewpoint. The WARMEF model was applied to the Catawba River 
watershed of North and South Carolina to simulate flow and water quality in rivers and a 
series of 11 reservoirs [5]. The data generated in Ref. [6], can be directly applied to the 
EIA prediction model and serve for EIA. Edgar et al. [7] analyze the performance of the 
physically based snow model SNOWPACK to calculate the snow cover evolution with 
input data commonly available from automatic weather stations. A system for observing 
meteorological data based on a wireless sensor network is designed to fulfill the business 
requirements for meteorological data observation in unattended areas [8]. This paper puts 
forward the simulation method where the meteorological data can be classified into 
discrete data and continuous data which can be simulated respectively. This paper pro- 
ceeded with simulating the meteorological data. Based on the meteorological data that 
has been collected. Task deducing and simulated training are conducted by coordinating 
with the comprehensive task control system and the result is satisfactory. 
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1 Meteorological Data Collection and Sorting 


Under the condition that the way of establishing statistic model has been determined, 
appropriate meteorological data can be collected according to the types of meteoro- 
logical data which is required to simulate. Table 1 shows the daily meteorological data 
of one of the districts in Changchun. 


Table 1. The daily meteorological data of one of the districts in Changchun 


Time | Weather Temperature | Relative Air pressure | Wind Wind 
condition (°C) humidity (hpa) force direction 
21 56 991 


8 am |Fine South 





9am | Fine 23 South 





















































North 
North 
West 
South 
East 
North 
South 


lam | Fine 





2am |Fine 





3 am | Fine 





4am | Fine 





5 am | Fine 





6am | Fine 





2 
l 
10 am | Fine 2 Southwest 
11 am | Fine 2 West 
12 am | Fine 0 North 
13 pm | Fine 2 Northwest 
14 pm | Fine l West 
15 pm | cloudy l Northwest 
16 pm | cloudy 0 North 
17 pm | Fine 0 North 
18 pm | Fine 0 North 
19 pm | Fine 0 North 
20 pm | Fine 0 North 
21 pm | Fine 0 North 
22 pm | Fine 0 North 
2pm |Fine 0 North 
24 pm | Fine 0 North 
0 
0 
2 
l 
l 
0 
l 


7am |Fine 


2 Initial Meteorological Data Analysis 


Meteorological data variable mainly consists of six types which are meteorology, 
temperature, relative humidity, air pressure, wind power, wind direction. According to 
the meteorological knowledge, initial analysis will be conducted on the types of 
variable as well as the value range. Atmospheric phenomenon generally includes the 
sky condition, rainfall, snowfall which are usually described as fine, rain, fog, snow. 
Meteorology names physical quantity that measures the degree of the hot and cold as 
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air temperature, international standard temperature unit of measurement is Celsius. 
Relative humidity is the specific value of absolute humidity and saturated absolute 
humidity under the same air temperature, it is also a percentage. Air pressure is the 
hydrostatic pressure on a specific point which is generated by air. The resource of the 
air pressure is gravity of the atmosphere, known as atmospheric pressure on unit 
square. The measurement unit is hpa. Meteorology defines the direction that the wind 
blows from as wind direction which normally falls in to sixteen directions. Wind speed 
is usually represented as the level of wind. Wind level is determined by the degree of 
effect that the wind generates on the ground objects. In meteorology, wind power can 
be classified in to thirteen levels. Two kinds of variables can be concluded based on the 
initial analysis of these variables. The first type of variable includes weather condition, 
wind power and wind direction, their value ranges belong to set value which are 
defined as discrete variables. The second type of variable includes temperature, relative 
humidity and air pressure. They are numeric variables which are defined as continuous 
variables. 


3 Stimulation of the Discrete Data 


Markov process is such a kind of process: In the process of x(t), every transfer of the 
state is related to the state of the previous moment but not the past state. In other words, 
there is no after effect during the state transfer process, hence, such a state transfer 
process is called Markov process. Markov chain is a discrete Markov process in the 
time discretization state. 

Markov chain is a sequence of random variables like X1, X2, X3, X4. The range of 
these variables called state space is the set of all their possible values and the value of 
Xn represents the state in time n. If Xn + 1 is a function of Xn with respect to the 
conditional probability distribution of the past state, then 


P(Xn+1 = x|Xo, X1, X2, 2s a) = P(X, 41 = x|X,,) 


This identical equation represents the characteristic of the Markov chain. In the 
Markov chain, n-step transition probability has the following characteristic: 


n (1) (n—l) 
Py = S Pix Pr 
kel 


This equation is called Norman-Kolmogorov Equation which solves the relation- 
ship between n-step transition probability and one-step transition probability and its 
matrix format is: P“*+” (n) = P (n)P™ (n +k). 

Then we take an example of the weather condition to illustrate the process of the 
data generation. Firstly, we assume that weather condition at time a + 1 is only cor- 
related to the weather condition at time a but has nothing to do with the previous 
weather condition. This process is consistent with Markov process, then Markov Chain 
can be used to calculate probability transition. 


Research on Meteorological Data Simulation 475 


3.1 Generate Initial Value 


Statistics of the Meteorological data can help figure out the frequency of each kind of 
weather condition and calculate the probability of each kind of weather condition, the 
results are shown in Table 2. 


Table 2. The frequency and probability of each kind of weather condition table 


Partly cloudy | Cloudy Rainy 


83 20 7 5 3 
69.1% | 16.6% 5.8% |4.1% 2.1% 


Divide the value range of the random number into corresponding interval according 
to the probability and then stochastic number p is generated. Then, the interval of the 
stochastic number p can be estimated while the initial value of the meteorology can also 
be determined. For example, stochastic number range [1-100] can be divided into fine 
[1,69], partly cloudy [70,86], cloudy [87,93], overcast [93,97], rainy [98,100]. The 
generated stochastic number 19 suggests that the initial value of the weather condition 
is fine. 


Weather condition 






Frequency 
Probability 





3.2 Calculate Probability Transition Matrix 


Then statistics of the meteorology data also contributes to obtaining the transition 
frequency and the matrix F. 


Jas to clear Teg to partly Tai to cloudy Tr to overcast [a to rainy 74 6 3 0 0 
Ta to clear oon to partly fre to cloudy i to overcast ier to rainy 3 14 i 1 l 

F Tom to clear Jon to partly To to cloudy Jasni to overcast Tra to rainy — 0 + 3 0 1 
Teen to clear | ee to partly | ee to cloudy Teens to overcast oe to rainy 0 0 2 3 0 
(a to clear [a to partly Taa to cloudy io to overcast n to rainy 0 0 2 0 1 


Then, transition probability matrix p can be generated according to the matrix F. 


0.89 0.07 0.04 0 0 

0.15 0.7 0.05 0.05 0.05 

P=] 0 0.5 0.375 O 0.125 
0 0 0.4 0.6 0 

0 0 0.66 0 0.34 


By employing Norman-Kolmogorov Equation, we can calculate n-step transition 
probability matrix P®, P®, ..., PO). 


476 Z. Wu et al. 


3.3 Generate Data from Transition Probability 


The way of generating the weather condition at time n is similar to the way of gen- 
erating the initial value of the weather condition. Divide the value range of the 
stochastic number into corresponding intervals according to the transition probability 
of the initial value in matrix P™, and then stochastic number p and its interval can be 
obtained. Hence, the value can be determined. 


4 Continuous Data Simulation 


Linear regression analysis is the common method of establishing meteorology appli- 
cation model. The relationship exists among many of the meteorology variables through 
relevant meteorology knowledge. Moreover, linear regression analysis is thoroughly 
complete from model establishment to authentication method theory. Meteorological 
data can be forecast and analyzed by employing this model. 





.00 5.00 10.00 15.00 20.00 25.00 


X 


Fig. 1. Temperature and time scatter figure 


For continuous variables, existing meteorological data can be used to conduct 
multiple linear regression analysis and establish the mathematical relationship among 
meteorological data. Substitute the generated variables into the equation in order to 
generate continuous variables. The following temperature example is to introduce the 
process of generating continuous variables by employing the multiple linear regression 
analysis. By screening out the factors, choose time and weather condition as the 
dependent variables in the regression analysis of temperature, the scatter diagram of 
time and temperature is shown as Fig. 1. 

From the above Fig. 1, temperature is in a steadily increasing trend from 2 pm to 
2 am; However, temperature is decreasing from 3 am to 1 pm. Thus, it is necessary to 
conduct multiple linear regression analysis on these two time periods. 
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We need to endow meteorological value with numerical meanings in order to 
quantify the factors, the result is shown as Table 3. 


Table 3. Meteorogical quantify table 


ay cloudy a Rainy 


Weather condition 
Value 





The following explanation is the process of establishing regression model of 
temperature from 2 pm to 2 am: 


y = Bo + Bix + B2X2i + £i, i=1,2,3...n (nis the number of value) 


where gs; is mutually independent and follow normal distribution N(O, 07). 
The 65 times records can be written as: 


J S Bo T Bix + B5x12 Smee | 
= Bo IF Bj x21 + Bx29 + & 


Yn = Bo ar BiXn1 + BoXn2 + En 


If transfer this equation set into matrix, then: Y = XB + € 
Where matrix: 


28 € 
a Ü oe do 1 14 1 ? 
y2 28 Üu a 115 1 B eI 
0 
Y = = X = = B = Bi E€ = 
B2 
1 x65 1 X65 2 1 26 1 
Ys 15 £65 


For the equations above, we can use the least square method to find the solutions. 
The fundamental principle of the least square method is to find bo, b1, ... ba which can 
minimize the residual sum of square between observation and regression value. It is 
equivalent to find the solutions of the equations: 


5 (x = xp) (x = Xp) ~( 7 (xY — B'X'Y — Y'XB + px'xB) =6 S OF — 2Y'XB + p'x'xB) = 
—~X'Y+X'xXp =0 XY=X'Xp = p=(X'X) 'X’Y 


We can obtain the estimations bo, b1, b2 of Bo, Bi, B2 which are 42.9, —1.08, —0.85. 
Therefore, the regression equation of temperature is: y = 42.9 — 1.08x, — 0.85x>. 
Then, it is required to proceed with the test of significance of the regression equation. 
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4.1 Certainty Coefficient of the Equation 


The certainty coefficient of the equation represents the explanation degree of inde- 
pendent variables against dependent variables, the certainty coefficients of the regres- 
sion equation are shown as Table 4: 


Table 4. The deterministic coefficient of the regression equation table 


Item R oR | Adjusted R? 


Value | 0.905 | 0.819 | 0.813 


From the diagram, it can be known that the coefficient of R is 0.905, the coefficient 
of R? is 0.819, the coefficient of adjusted R? is 0.813, we have to preferentially consider 
adjusting the coefficient of R*. Since 0.813 is bigger than 0.05 and approaches 1, the 
explanation degree of the independent variables against dependent variables is high. 


4.2 Significance Test of the Linear Relationship of the Regression 
Equation 


Variance analysis is a way of decomposing the sum of squares of deviations and its 
degree of freedom and examining the linear relationship between independent and 
dependent variables by employing statistical magnitude F. The variance analysis is 
shown as Table 5: 


Table 5. The variance analysis table 


Sum of square | Degree of | F value 
freedom 


Source of variance 











Regression 1037.914 2 140.536 
Residual value 228.947 
Total 1266.862 


Assume: { A al a 


H; : B,, Pa at least one nonzero 
We can identify significance level « = 0.05 and find the rejection region. We find 
Fo.o5(2, 60) = 3.15 by referring to the table, 


F = 150.536 > Fo.o5 (2, 60) > Fo.o5 (2, 62) 


By rejecting Ho and assuming to accept Hı, we can believe that the linear rela- 
tionship between independent variables and dependent variables is significant. 
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4.3 Significance Test of the Regression Coefficient of the Regression 
Equations 


The purpose of significance test is to identify the significant linear relationship between 
independent variable and dependent variable by testing the significant difference 
between regression coefficient and 0. 

t-test on regression coefficient: 

Given level of significance o = 0.05, calculate the value of the test statistic t: 


ae Estimation of regression coefficients 
= The standard deviation of the regression coefficients 


ti = —16.698, to = SAT I. 

ty2(n—m-1) = ty2(62) = 1.999, |t| > ty2(62), lt2| > t.2(62). Reject null hypothe- 
sis, then linear relationship between independent variable and explained variable is 
significant which should be in the equation. 


4.4 Introduce Normal Distribution Error 


The result generated from the formula is obviously the ideal value. Meteorological data 
has only several sets of values since fixed independent value will also contribute to 
fixed dependent variable. In order to make the data more closed to the real value, 
random error of normal distribution would be introduced once the value of variable has 
been generated. Box-Muller is a widely used calculation method which is normally 
distributed. It is efficient and simple in calculating. In Box-Muller method, two evenly 
distributed random numbers nl, n2 whose value range is (0,1) are required: 
N = \/—2In(n,) cos(27n2). 

Generate a normally distributed random number N (0,1). A normally distributed 
random number whose mean value is a and standard error is sd can be obtained by 
doing the following transformation: X = a + (N * sd). 


5 Conclusion 


This paper mainly focuses on the simulation method of the meteorological data. These 
meteorological data can be classified into discrete variable and continuous variable 
according to their own characteristics and respectively simulated. This paper begins by 
collecting and analyzing the data. Then, Markov model is introduced to generate 
discrete data. Finally, multiple regression model is required to e analyze both generated 
variables and processing continuous variables. This paper provides a reliable data 
simulation method for the application of UAV in logistics, disaster relief, medical care. 


480 Z. Wu et al. 


References 


1. Molina Martinez, J.M., Martinez, A.V., Gonzalez-Real, M.M., Baille, A.: A simulation model 
for predicting hourly pan evaporation from meteorological data. J. Hydrol. 318(1), 250-261 
(2006) 

2. Gao, Q., Liu, J., Yang, L.: Sensitivity studies on elements of meteorological data for building 
energy simulation in China. In: International Building Performance Simulation Association, 
pp. 217-222 (2007) 

3. Cui, Y.J., Gao, Y.B., Ferber, V.: Simulating the water content and temperature changes in an 
experimental embankment using meteorological data. Eng. Geol. 114(3), 456—471 (2010) 

4. Brose, N.: Specification of meteorological data requirements for a wind power infeed model 
used in power system simulator. In: 12th International Conference on Environment and 
Electrical Engineering, pp. 140-144 (2013) 

5. Herr, J.W., Vijayaraghavan, K., Knipping, E.: Comparison of measured and MM5 modeled 
meteorology data for simulating flow in a mountain watershed. J. Am. Water Resour. Assoc. 
46(6), 1255-1263 (2010) 

6. Wang, Q., Li, S., Ding, F., Zhao, X.: Simulation of high-altitude meteorological data used to 
environment impact assessment by MM5 model. Procedia Environ. Sci. 2, 1713—1716 (2010) 

7. Edgar, S., Christoph, M., Charles, F., Michael, L.: Evaluation of modelled snow depth and 
snow water equivalent at three contrasting sites in Switzerland using SNOWPACK simulations 
driven by different meteorological data input. Cold Reg. Sci. Technol. 99, 27—37 (2014) 

8. Reza, A., Zhiliang, Z., Shuang, Z.: Design and simulation of a meteorological data monitoring 
system based on a wireless sensor. Int. J. Online Eng. 12(5), 27—32 (2016) 


Anti-data Mining on Group Privacy Information 


Fan Yang’, Tian Tian', Hong Yao’, Xiuyu Zhao”, 


Tinggang Zheng’, and Min Ning” 


: Computer School, China University of Geosciences, Wuhan 430074, China 
planesail@163.com, {tiantian, yaohong}@cug.edu.cn, 
2829936010@qq.com, 2289067224@a@qq.com 
2 Founder International Software (Beijing) Co., Ltd., Beijing 100080, China 
ningmin@founder.com 


Abstract. In the big data era, privacy preserving is a vital security challenge for 
data mining. Common object of privacy preserving is personal privacy, which 
should be kept unrevealed while data mining on group information. However, for 
a few sensitive groups, such as suffering from some particular disease, engaging 
in some special occupation or having some peculiar hobby, even if every personal 
data is processed for privacy preserving, group specificity can be still exposed. 
Therefore, we propose the concept and method of anti-data mining on group 
privacy information. By adding, swapping data according to our rules, the minable 
characteristic and group specificity of original data is destroyed and eliminated 
to prevent group privacy from data mining. 


Keywords: Anti-data mining - Group privacy information 
Privacy preserving - Group specificity 


1 Introduction 


The concept of “big data” has appeared in the field of physics, biology, environment 
ecology, finance and communication for several years. In 2011, the well-known 
consulting company Mckinsey predicted that “the era of big data” was coming for the 
first time [1]. Big data is a term for data sets that are so large or complex that traditional 
data processing application software is inadequate to deal with them [2]. The lifecycle 
of big data includes data extraction and integration, data analysis and interpretation, 
among which data analysis is the core [3]. Different from conventional data, big data 
possesses the features of “4V”, which are Volume, Velocity, Variety and Value [4]. 
Consequently, the common ways of data analysis aren’t suitable for big data anymore. 

Data mining 1s defined as the procedure of extracting or excavating useful knowledge 
from vast data stored in database [5]. Pattern and feature contained in big data is so 
valuable that almost all industries, such as enterprises, telecom operators and govern- 
ments, are engaged in data mining. Sometimes, science and technology is a double- 
edged sword. Data mining on big data brings forth both mass valuable information and 
huge privacy leakage risk. The data mining algorithms are bound to collect abundant 
users’ data for a long term to conclude the behavioral habits behind [6]. 
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The concept of privacy preserving data mining (for short, PPDM) was firstly 
presented in [7] to resolve the conflict between the precise excavation of knowledge rule 
and the privacy protection of original information. Lindell indicated the need to protect 
data privacy while ensuring data accuracy [8]. Evfimeievski used randomization to 
establish an accurate data mining model for aggregated data [9]. In 2004, Vaidya intro- 
duced how data mining should be changed to accommodate the attacks of privacy advo- 
cates [10]. To solve the problem of constructing aggregated data without accurate data, 
Zhang Nan describes the related technologies such as reconstruction [11]. By 2005, 
Vaidya has proposed a relatively systematic idea and solution in [12], including a variety 
of corresponding models. Lindell focused on the decision tree algorithm, especially the 
ID3 algorithm [13]. Most recently, Saranya conducted extensive surveys of different 
PPDM algorithms and analyzed the representative technologies [14]. In addition, since 
PPDM is a highly integrated cross-cutting topic, there have been significant progress 
for PPDM in statistics, machine learning, etc. [15—17] 

Common object of PPDM is personal privacy, which should be kept unrevealed 
while data mining on group information. But for a few sensitive groups, such as suffering 
from some particular disease, engaging in some special occupation or having some 
peculiar hobby, even if every personal data is processed for privacy preserving, group 
specificity can be still exposed. Group privacy refers to the private information shared 
inside the group, but unwilling to reveal to ones outside the group. Anti-data mining 
indicates destroying the minable specialty of raw data and invalidating data mining to 
protect the covert information contained in big data. 

The rest of this paper is formed as following. Section 2 lists the technological clas- 
sification of PPDM, compares existing corporate privacy and anti-data mining to ours. 
Section 3 explains our concrete algorithms. Section 4 conducts a series of experiments 
to state the validity of above algorithms. Section 5 draws a conclusion. 


2 Background and Related Work 


2.1 Classification of PPDM 


According to the mainstream technologies of the raw data transformation, PPDM can 
be classified into five dimensions [18]: 


(i) Data distribution 

(i) Data modification 
(iii) Data mining algorithms 
(iv) Data or rule hiding 

(v) Privacy preservation 


The first dimension of data distribution can be classified as centralized data and 
distributed data scenarios [19]. Distributed data scenarios can also be divided into hori- 
zontal data distribution [20—22] and vertical data distribution [23-25]. Generally, the 
raw data needs to be modified before releasing to the public for privacy. The data 
modification of second dimension is divided into perturbation [26, 27], blocking [28], 
ageregation/merging, swapping [29] and sampling [30]. The third dimension refers to 
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the data mining algorithms, among which relatively important ones are decision tree 
inducers [31], association rule mining [32, 33], clustering algorithms [34], rough sets 
[35] and Bayesian networks [36]. The fourth dimension refers to whether raw data or 
aggregated data should be hidden. To achieve higher utility and protect privacy, selective 
modification is essential. The last dimension of privacy preservation to modify data 
includes heuristic-based techniques [37], cryptography-based technique [38], recon- 
struction-based techniques [39] and so on. 


2.2 Group Privacy and Anti-data Mining 


In China, a piece of news raised wide concern that 275 HIV-infected patients from 30 
provinces declared the reception of fraud calls for the leak of authoritative AIDS data- 
base in July 2016 [40]. For this special group, any sensitive item revealed from the 
database may label members as HIV or AIDS, probably resulting in discrimination, 
losing job, even suicide. Even individual data is processed for privacy preservation, 
message from group members still carries the group specificity. Once the mass fragments 
are excavated and pieced together as integrated information, the group privacy will be 
exposed in the end. Consequently, protecting the individual privacy of group is far from 
enough. For this reason, we propose the protection on group privacy information. 

The traditional PPDM means learning the group pattern and assuring the individual 
privacy secret by limited data mining [41], while our anti-data mining prevents the group 
specificity from data mining to protect group privacy. “Corporate privacy” mentioned 
in [42] protects data from distributed sources by secure multiparty computation instead 
of the trusted third party, while we centrally pre-process the raw data to hold back data 
mining on group privacy. Different from the anti-data mining in [43, 44], where noise 
data is added to personal information, our method is applied to group attributes. 


3 Anti-data Mining on Group Privacy 


3.1 Scenario Description 


We present anti-data mining on group privacy to solve the dilemma from the following 
scenario. There are many active social networks based on common hobby, belief or 
anything else. For example, there exists a network community among which are all AIDS 
patients, meeting the social need for this special group. But this community is quite 
vigilant against outside for the real world still discriminates them. Existing data mining 
tools may collect the google searching records to find out the keyword (AIDS) of this 
community. Here comes the requirement that protects the common feature of this group. 
Our work pre-processes the searching records so that both the needs of acquiring infor- 
mation and protecting group privacy are met. 
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3.2 Method Feasibility 


Group specificity distinguishes group members from other random members, implying 
the existing of a special group and the common privacy. Once we lessen or eliminate 
the specificity, the group privacy can be protected. On the premise of not obstructing 
the regular communication with outside, we pre-process the data issued by group 
memberships. The material methods are adding and swapping according to some rules 
so that it can’t be clustered as before. At last, the transformational searching records lose 
their original group specificity completely. The more approximate are the results to 
searching records of random user, the greater effect does we get. 

Our model is the agency of searching engine between group and outside Internet, 
preprocessing the searching requests from group individuals and returning the results 
back by converse progress. The preprocessing aims to convert the searching records 
with obvious group specificity into so random ones that data mining on such records 
can’t get the group characteristics about AIDS. Meanwhile, the effect of information 
retrieval remains insusceptible, 1.e., the converse process can filter the rough searching 
results into the interested information about AIDS for group individuals from seemingly 
random result pool. 


3.3 Relevant Definition 


Clustering is a division of data into groups of similar objects. Each group, called cluster, 
consists of objects that are similar between themselves and dissimilar to objects of other 
groups [45]. The similarity is measured by the distances between the described objects. 
As searching record is the main study object of our thesis, complex text clustering need 
calculation of nonmetric similarity function. So, we introduce Jaccard index. Given two 
sets, A and B, the Jaccard index is defined as the ratio of intersection and union about 
A and B. 


Definition 1. Point of Searching Record. Point of searching record refers to the set of 
searching keywords that the individual of special group submits in Internet, denoted as 
PSR as below. There are n searching records in PSR, denoted as r;. 


PSR = {r,,1%,° Ty} (1) 


Definition 2. Point Similarity. Point similarity of two PSRs are defined as the Jaccard 
index of two sets. For set A and B, 


JAN B| _ IAN B| r 
AUB|  JA|4 |B|- |AnB] (2) 





PS(A, B) = 


Definition 3. Clustering Score. Clustering score indicates the proportion of special 
individuals in clustering result vs the random distribution. Suppose that n is the number 
of target records in cluster after clustering, s is the size of result cluster, r, is the ratio of 
total target record in total searching records. We use the following equation to normalize 
the clustering score into the interval of [-1,1]. 
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Definition 4. Average Clustering Score. Average clustering score indicates the average 
of absolute value sum about all result clusters. Assume that there are m final clusters, 
then ACS is defined as below. 


7 |CS,| + |CS,| + --- +|CS,,| 


m 


ACS (4) 


3.4 Algorithm Description 


Different to the common data mining, anti-data mining makes the result of clustering 
unapparent by variation process. Our method follows the general approach of data 
mining except the process of variation, indicating the core of group privacy preservation. 
By different clustering algorithms, patients of same kind tend to gather together into one 
or a few clusters unequally. We grade the clustering result, rank and vary it, after which 
it’s graded again to judge whether the mixing is uniform. If not, we loop to vary and 
cluster again, or else output the result. In ideal condition, the target patients spread evenly 
in all result clusters, indicating the data mining on group privacy is invalid. 


3.4.1 Clustering 

We use clustering algorithm to testify the effect of anti-data mining. The distance calcu- 
lation of two points is the proportion of common entries to total entries in the clustering 
procedure as the Jaccard index in Eq. (2). The central point is a virtual and auxiliary one 
containing the most frequently used entries, whose length is the average of all points 
inside a cluster. This parameter is used frequently in the following steps. 


3.4.2 Grading 

Grading refers to the hidden extent of target expressed by the uniformity degree distrib- 
uted in every cluster. Based on the expectation distribution of targets in the cluster, 
calculate the deviation coefficient to the expectation. The calculation follows Eq. (3), 
where the result CS falls into the interval of [—1,1]. The value is closer to 1, the better 
the clustering effect is. The result of 1 means all records in the cluster are target. The 
value is closer to —1, the worse the clustering effect is. The result of —1 means no records 
in the cluster are target. But for anti-data mining, the perfect result is 0, meaning final 
target after clustering approximates to the random distribution. 
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3.4.3 Variation 

Before variation, rank the clusters in descending order according to the above grades. 
The variation consists of two situations: adding and swapping. In the former adding, 
match a pair clusters of the correspondent high and low grades every time. Then, 
randomly select a keyword of central point from the other to add into one of its own real 
records. If the keyword number of the added record is more than the maximum allowable 
value, then replace the newly added one. By adding uncorrelated noise to change the 
clustering characteristics of original records, the records belonging to one or a few clus- 
ters intensively can be randomly scattered into clusters as many as possible. In the latter 
swapping, exchange a keyword pair from two neighboring clusters every time, where 
the two keywords both come from the central point sets of each cluster. By reducing the 
occurrence of high-frequency words, the clustering results can be altered. 


4 Experiments 


4.1 Construction of Experimental Data 


At the beginning, two dictionaries are built artificially. One includes 100 entries about 
AIDS keywords, such as HIV, homosexuality, incubation period, sex and so on, named 
as target dictionary. The other includes 400 entries selected from everyday vocabulary 
randomly, named as common dictionary. Next, we should construct ten original clusters, 
where the first cluster is all about AIDS and the other nine are non-AIDS. The AIDS 
cluster contains 1000 records, where each record consists of one to twenty entries picked 
from the target dictionary. Other nine original clusters are built as above from the 
common dictionary. 


4.2 Disposal of Anti Data Mining 


Using k-means algorithm on 10000 records from above ten original clusters, in which 
k (k = 5) indicates the number of result clusters. 


Table 1. Clustering condition before variation (adding only). 


Generation ACS 
692/692 | 0.871827 
668/668 | 0.8769458 
690/690 | 0.8682812 
1323/0 | 0.8219308 
674/6714 | 0.864025 
1287/0 | 0.8234368 
1086/0 | 0.8157644 
669/0 | 0.821806 


In Experiment 1, we gather statistics of running generation, target distribution in 5 
result clusters and Average Clustering Score on eight independent tests, given the ending 
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condition of ACS < 0.1. In this experiment, we test our method on two variation ways: 
adding only vs both adding and swapping. Tables | and 2 list the clustering conditions 
before and after variation (adding only). Figures | and 2 illustrate the target distributions 
in clusters before and after variation (adding only). 


Table 2. Clustering condition after variation (adding only). 


Generation Cluster 5__| ACS 





7 1562/140 | 0.075076 
3 1378/152 | 0.0680522 
524/54 | 0.0639792 
5 887/72 | 0.074742 
5 2785/245 | 0.0994694 
6 2488/303 | 0.0866714 
5 779/173 | 0.086152 
6 776/94 | 0.0483918 
7000 7000 
6000 6000 
5000 5000 
4000 4000 
ely mnon-AIDS mAIDS 3000 m non-AIDS mAIDS 
2000 2000 i F 

1000 
1000 
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Fig. 1. Distribution of target before variation Fig. 2. Distribution of target after variation 


Comparing Tables 1 and 2, before variation, the clustering gathers the target records 
into one or two clusters intensively and the ACS approximates 0.9 indicating the distri- 
bution is uneven. After variation, every cluster has target records and the ACS approx- 
imates 0.1. As our experimental data includes 1000 target records and 9000 normal 
records, 0.1 implies so random distribution that data mining is without effect on group 
privacy. We choose the data of last line in Tables 1 and 2, showing the distributions in 
histograms of Figs. 1 and 2. From these two figures, we find before variation, the sizes 
between clusters are quite different. While after variation, every cluster is more similar 
in scale than before and the target records have uniform distributions. 

Table 3 lists the clustering condition after variation (both adding and swapping). For 
this variation includes adding and swapping, the effect is more obvious than Table 2 in 
that the ACS and running generations are less. In particular, adding only is suitable for 
the relative centralization of target data in early variation, while adding and swapping 
is appropriate for the relative uniformity of target data in late variation. 
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Table 3. Clustering conditions after variation (Adding and Swapping). 


Generation Cluster 5 ACS 
1596/147 | 0.0639128 
536/42 0.0830066 
1498/115 | 0.0643302 
1092/86 | 0.0903424 
896/77 | 0.060269 
1053/96 | 0.082661 
1120/122 | 0.0523848 
891/89 | 0.095712 


In Experiment 2, we gather statistics of running generation, target distribution in 5 
result clusters and Average Clustering Score on eight independent tests, given the ending 
condition of running generation = 10. In this experiment, we test our method on two 
variation ways: adding only vs both adding and swapping. Table 4 lists the ACS of 
adding only vs adding and swapping. Comparing the two columns, we can find adding 
and swapping is better than adding only for ACS is much less in average. 


| B&B} GO| ON | | O |] | Go 


Table 4. ACS of adding only vs adding and swapping (generation = 10). 
ACS (adding only) | ACS (adding and swapping) 


0.118620 0.071738 
0.176082 0.058726 
0.062426 0.084542 
0.115515 0.028413 
0.070112 0.102586 
0.057862 0.052701 
0.628860 0.107852 
0.132182 0.055899 


Figure 3 illustrates the two scatter diagrams of Experiment 1 and 2. According to 
the two variables: running generation (x-axis) and ACS (y-axis), we get two curves of 
linear regression. In common, two experiments fit a power function with a negative 
exponent. In early variation, y decreases rapidly with the increase of x and the rate of 
decrease slows down continuely later. In comparison, the tendency of decline tends to 
be gentle when generation >=5 for adding only, while the generation threshold is 4 for 
adding and swapping. Given the same running generation, adding and swapping gains 
smaller ACS. The derivative is larger than adding only proving the blue curve decline 
to nearly horizontal more quickly than the red one. In conclusion, adding and swapping 
is better than adding only in variation effect. 
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Fig. 3. Scatter diagrams of Experiment 1 and 2 


5 Conclusion 


This paper puts forward the concept and realization of anti-data mining on group privacy 
information. Transforming the clustering process by adding and swapping, the group 
specificity is altered and the data mining becomes ineffective. The validity of our idea 
is testified by a series of experiments. 

In the era of big data, anti-data mining is of practical significance for privacy 
preserving and data security. We will research further on this theme in the near future. 
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Abstract. With the increasing popularity of online shopping, it has brought with 
its massive online consumers and the growth of merchandise information data. 
In order to deal with the demand for big data processing, building an analysis 
system of e-commerce reviews base on Hadoop software framework. The reviews 
of Internet commodity are chosen to be the samples of study. Choosing Navie 
Bayesian classification to analyze the attributed values are discrete. The classifi- 
cation algorithms in accordance with MapReduce parallel computing theory 
designed and run on Hadoop platform. Constructing the Naive Bayesian senti- 
ment classifier, and make the classifiers on the Hadoop platform to achieve 
commodity reviews mining job. Result shows that it can improve the efficiency 
of the commodity reviews analysis by using the Hadoop distributed platform. 


Keywords: Hadoop - MapReduce - Big data - Emotion tendency 


1 Introduction 


According to Chinese Online Shopping Market Research Report in 2015 [1], which 
CNNIC published in 2016, China’s online sales continue to maintain the high growth 
rate. With online shopping becoming more and more popular, it has also brought the 
explosion of commodity review texts, produced huge amounts of data information, so 
the demand for the analysis of the Internet commodity reviews is higher efficiency. For 
automatic text sentiment analysis, traditional single machine has some limitations. This 
project takes advantages of the MapReduce programming model of distributed 
computing, based on Hadoop, compared with single machine, it has more CPU kernel 
number and bigger RAMs. Under the lots of data reviews text information, with Hadoop 
distributed computing framework in the review of text information has important signif- 
icance. 


2 Analysis of Review Texts and the Research of Hadoop 


2.1 Analysis of Review Texts 


The text emotion analysis is also called opinion mining, it refers to the machine learning, 
statistics, natural language processing and other techniques to automatically extract, 
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define or describe a content containing emotional information, achieves to make clas- 
sification judgments of the emotional tendencies of overall text or of some subjective 
[2, 3]. In order to judge the emotional tendencies of the reviews. Naive Bayesian is 
chosen to be classification method. Naive Bayesian classification model has the 
following advantages: (1) algorithm theory is simple, high accuracy; (2) classification 
time is shorter; (3) support for different kinds of data types, the algorithm is more stable. 

Bayesian theorem is actually applied as follows, suppose a purchase experiment E, 
X is adescription of N attributes value of a customer, such as age, gender, annual income 
and other statistical values, H represents the purchase hypothesis, H = 0O means the 
purchase, H = 1 means no purchase. Which P(H | X) represents the probability that H 
occurs when condition X is known, and P(H | X) is the posterior probability. P(H) is the 
probability of occurrence of event and P(X) is the probability of occurrence of event X, 
they are called the prior probability (unconditional probability) and P(X | H). The three 
probability values can be obtained from the historical data, the probability of user 
purchasing a product need to be predicted, i.e. the probability of event P(H |X). 
According to the Bayesian theorem, the equation is available. 


P(X | H)P(H 
PHD = SO (1) 


The above pilot knowledge can be used to illustrate a simple Bayesian classification 
problem. Here is a training set of samples D, where each row of data represents a training 
tuple and the tuple class attribute value label, each tuple with an n-dimensional attribute 
vector represents X = {a, ,Ay,...Q,_ 45 As {ay syss a,» represents the meas- 
ured value of A,,A,,...A,_;,4, corresponding to the n characteristic attributes. Also 
assume there are m class variables C,,C),...C,_4,€ 


m—-1? “m'’ 


n—-1? 


Assuming tuple vector X is known, classifier needs to predict class C, of X belongs 
to, in fact, obtain the maximum posteriori probability of P(C; |X) and classification 
prediction result for test set X can be obtained from Eq. (2). It is the value of C; when 
P(C; | X)P(C,) get the maximum value 


k=1 


class(X) = arg max{ P(X | C;)P(C;)} = arg mand PC) [Pæ | cy} (2) 


2.2 Research of Hadoop 


Hadoop distributed software framework is designed to solve the problem of large data 
computing, through distributed data storage and distributed computing. Therefore, 
Hadoop’s basic platform is the core structure of HDFS distributed storage system and 
the distributed data processing model and execution environment of MapReduce [4]. 
HDFS can automatically divide large files into many parts, and then upload these 
parts to the computer node of the same Hadoop cluster. Users only need to log in to the 
HDEFS root directory to view all the shared files in the system, rather than knowing which 
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computer is the part of the file belong to. At the same time large-scale file block storage 
is the base of achieve mass data parallel computing. HDFS distributed file system has 
the advantages of detecting data node failure, setting file copy, high availability and low 
operating costs and so on. MapReduce parallel computing model needs to implement 
Map and Reduce two functions, with the way of “divide and rule” to achieve the idea 
of distributed parallel computing [5]. Map function part of the MapReduce calculation 
program is mapped to the part of the data stored on the machine to calculate the Reduce 
function phase based on the actual data situation. Reduce function phase based on the 
actual data situation, generate a number of Reduce tasks on the Map phase of the data 
protocol, processing. 


3 Design and Development of Review Texts Analysis System 


3.1 Architecture Design of Review Texts Analysis System 


In this study, a analysis system of e-commerce review texts is designed, under distributed 
storage of Hadoop platform and framework of parallel computing, which is based on 
the needs of businessmen’s automatic acquisition of product reviews, secure storage and 
emotional analysis. Based on the B/S architecture WEB management system, the system 
is divided into Hadoop distributed cluster product review data analysis module, data 
storage module and data display module. 

Data analysis module is the core module of the whole system. Its main job is divided 
into three parts: first part is preprocessing and distributed storage the date of the product 
sales and review. Second part is the implementation of Mapreduce parallel computing 
framework analysis mining commodity review data information, the processed data 
transferred to the database server. Third part is the data migration work. Soop is chosen 
to be the cross-platform data migration tool, it is good at dealing with transporting the 
data of HDFS distributed storage system to the database server. 


3.2 Design of Review Texts Analysis 


According to the design requirements of analysis system of e-commerce reviews, 
emotional tendency classifier based on Naive Bayesian classification algorithm is real- 
ized under Hadoop platform. The system architecture of the naive Bayesian affective 
classifier is shown in Fig. 1, which includes classifier a learning stage and a classification 
Stage. 

According to different kinds of products, reviews of the commodity may show a big 
difference. Therefore, combining the corpus to build the product ontology library for 
the corresponding product. 


3.2.1 Construction of Review Texts’ Lexicon 

Major brands of chocolate is selected as the object of research, chocolate brands contains 
Dove, Ferrero and so on. The corresponding product page of the product reviews are 
randomly selected to analyze, based on frequency of the word frequency statistics, 
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Fig. 1. System architecture of the naive Bayesian affective classifier 


according to analytic hierarchy process to determine the weight of the Review elements, 
constructing review elements and weight table, the results are shown in Table 1. 


Table 1. Review elements and weight table 


Review elements | Weighted value (N) 
Product g 

Price 
Logistics 
Public praise 
Pack 
Promotion 
Other 


wW APII N 


At the same time, it has been observed that there are many elements of reviews are 
not directly described, but are described by other words, taking into account the lack of 
elements. Therefore, it is necessary to establish a classification table for the review 
elements, which can be classified into the corresponding review elements when 
matching the words of the classification table. Table 2 shows the review elements. 
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Table 2. Review element classification table 


Review elements Related words 


Product Product, taste, taste, delicious, Ø 

Price Cheap, affordable, cost-effective, expensive 
Logistics Speed, express 

Public praise Genuine, fake, come again, just so so 

Pack Packaging, grade, exquisite 

Promotion Promotions, discount 

Other Attitude, service, reply 


According to the review element words contained in each review text, the weights 
of the review elements are calculated according to the weight table of the review factor 
words. The calculation is shown in Eq. 3: 





i n 2 
y (3) 


In Eq. 3, N, represents the weight setting of the i-th review element in the review, 
n 
>) N, represents the sum of the weight values of all the review elements analyzed in the 
i=l 
review, and P. is the weight proportion that represents the i-th review element in the 
reviewary of this process. 

The effective factors in the review mainly include emotional words, negative words 
and turning words. Combined with Chinese Emotional Vocabulary Ontology Library 
[6], according to the review to improve the emotional word annotation, it is shown in 
Table 3. 


Table 3. Emotional poles vocabulary table 


Emotion tendency Marking 


Neutral In general, okay, gray 0 
Commendatory Good, like, authentic, beautiful | 1 





Derogatory Garbage, silent, bad, poor 2 


In judging the emotional tendencies of the reviewary document, it is often determined 
not only by the emotional polarity category of the emotional vocabulary in the text, but 
rather by the common use of the words used in the emotional vocabulary [7, 8]. So 
negative vocabulary is shown in Table 4. Turning vocabulary marked as Table 5. 
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Table 4. Negative vocabulary table 


Whether it contain negative | Vocabulary Marking 
words? 


Yes No, not, inferior to, 1 
incompatible, will not 


No GB O 





Table 5. Turning vocabulary table 


Whether it contains a turning | Vocabulary Marking 
words? 


Yes But, well, however, but, yet, | 1 
still, just 


No O 0 


In a commodity review sentence, although all the review elements are not evaluated, 
there are often include two or more concerns, positive or negative. If all for the same 
emotional tendencies, the emotional polarity of the sentence is easy to judge. But if there 
are two kinds of emotional tendencies, the general approach is to take a simple weighting 
approach to evaluate the weight of the elements, but this approach tends to have an 
erroneous effect on the results [9]. Therefore, during the emotional analysis of weighted 
calculation, a turning word before the review element, the weight of the previous clause 
can be weaken or strengthen the weight of themselves. When there are three or more 
review elements, the weakening of the previous term will increase the number of calcu- 
lations, combined with Table | to evaluate the elements of the word and the weight table, 
using Eq. 4 to calculate the influence of the turning point on the weighting elements of 
the review element. 





N; 
P, = —(1 +9) 

ÈN; 
= N, 0 € (0,1) (4) 
g=1- 





3.2.2 Naive Bayesian Classification 

It can be known that the Naive Bayesian classification algorithm only needs to calculate 
the maximum value of P(X | C;)P(C,) for the text emotion classification, from the Eq. 2, 
and can predict the emotional tendency of the data to be classified. The classified tuples 
data is called X and the attribute value of X is obtained by text preprocessing, feature 
and weight selection, P(C;) is called the prior probability of class C, and P(X | C;) is 
called the probability of vector X under class C,. The probability values of P(C;) and 
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P(X | C,) are unknown, the prior probabilities of class C, which can be obtained from 
training set data, Naive Bayesian algorithm is to calculate the prior probability and 
conditional probability value, Fig. 2 shows the algorithm flow. 








Calculate the prior probabilities 
of class C; 


: 


Calculate the conditional 
probability of class C; under each 
attribute 








Returns the label of class C; of 


the largest probability value 





Fig. 2. Bayesian classification algorithm flow 


After calculating the priori probability, calculating probability P(X | C;) of the tuple 
data X in class C,. It can be known that the attributes of X are independent of each other, 
from the Naive Bayesian assumption, and P(X |C;,) is obtained by multiplying the 
conditional probability of each attribute value under class C, 

According to different types of reviews, using the algorithm to deal with, various 
categories of P(X | C,) probability value can be obtained. By comparing the size of 
P(X | C,)P(C,), it can predicted the label of current data tuple X class. When a review 
text has multiple review elements, the emotional tendency of whole sentence can be 
determined by comparing P(X | C;)P(C;) * w; (w; represents the weight value) with the 
weighted value. 


3.3 MapReduce Parallel Computing Module 


MapReduce parallel computing module is that the calculation of the classification stage 
is integrated into the Hadoop platform, to achieve the emotional classification of parallel 
processing. In general among the Hadoop clusters in MapReduce parallel computing, 
Map phase is responsible for the main data processing work, Reduce phase is primarily 
responsible for data protocol operations, such as statistical operations, maximum fetched 
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operation. And training set of naive Bayesian emotional classifier construction stage is 
artificially annotated, and the amount of data is not large. The massive training set of 
data may also cause over-fitting problems. So the structure of classifier and classification 
stage are processed in the Map phase, classified statistic results is processed by the 
Reduce phase. The MapReduce parallel computing flow is shown in Fig. 3. 
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Fig. 3. MapReduce parallel computing flow 











4 Achievement of Review Texts Analysis System Based on Hadoop 


4.1 Achievement and Test of Review Texts Analysis System 


4.1.1 Data Selection and Evaluation Standard of Experiment 
Hadoop distributed cluster has a total of six servers, and these six servers to run as virtual 
machines build on the two desktop computers through VMware software. 

The experimental data set is a commodity review text of the chocolate brands that 
is collected on different platforms. Training set and test set are constructed by artificial 
annotation method. Before the data is marked out, the effective data is selected, and the 
data set is marked according to the lexical ontology database which has been constructed. 
Finally, Constructed one hundred compliment comments, fifty derogatory comments, 
thirty neutral comment data sets, according to 1:1 ratio of ninety assigned to the training 
set and test set. 
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In the review of the effectiveness of the classification of commodity reviews, the 
following review indicators are often used: the precision is recorded as P; the recall rate 
is denoted by R; the calculation of the next two indicators is described below (Table 6). 


Table 6. Product review test set classification example table 


Category The number of reviews of “good | The number of reviews of “bad 
reviews” reviews” 


The number of reviews b 
judged to be “good 
reviews” 


The number of reviews c d 
judged to be “bad reviews” 
The precision ratio P is the proportion of the correct value of the classification a in 


the classification result a and b. Then the precision P calculation equation as shown in 
Eq. 5: 


a 
P= 
a+b (5) 





The recall rate R is the correct value of the classification a in the test set attributable 
to “praise” the proportion of a and c. The recall rate R is calculated as shown in Eq. 6: 
a 


R= 
a+c 





(6) 


4.1.2 Design and Result Analysis of Experiment 

In order to verify the classification accuracy of the emotion classifier and the classifi- 
cation efficiency under the Hadoop platform, following experiment was designed: ninety 
test data was chosen to be the study object, the emotion classification of ninety test set 
data was analyzed, verify the accuracy of classification analysis. The experiment runs 
the program on the Hadoop platform to classify the review texts test set data. The result 
of experimental classification result is shown in Fig. 4. 


0006 e0003 2015-06-08 00005 +0001 6 0 0 
0006 e0003 2015-06-08 00006 +0001 18 0 0 
0006 e0003 2015-06-08 00007 +t0001 12 0 0 
0006 e0003 2015-06-08 00010 t0001 36 0 0 


Fig. 4. Result of experimental emotional classification results 


Figure 4, from left to right are the product brand code, electricity business platform 
coding, date, review category code, commodity category code, and the statistics of 
commodity, medium and bad reviews. The recall rate P of the experiment is zero point 
eight six two and the recall rate R is zero point eight eight. The experimental results 
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show that the accuracy rate of the classifier is more than eighty-five percent, which means 
the system is good at predicting the emotional tendency of the review texts. 

After analyzing the output result, it is found that the classification result is more 
accurate when the review text has only one review factor or only include emotional 
word. When the review text has multiple review factors, the classification result is prone 
to appear classification error, which leads to the decrease of classification accuracy. 


5 Conclusion 


Under the background of online shopping normalization and big data age, the e- 
commerce reviews are choosen to be the object of study, to analyze the review texts 
features, and designs the system of information retrieval based on Hadoop data 
processing technology. Extraction, analysis of additional business information from a 
large number of unstructured reviews text data, providing the important reference basis 
for merchants to adjust production and sales strategy. Although the work of the corre- 
sponding research is completed, getting a better analysis of the results, but there are 
some deficiencies, following aspects need to be improved: 


(1) Only the naive Bayesian classification algorithm is adopted. Although it is better 
to classify the commodity in the text, it can not get all the attribute characteristics 
of the text when it is large. 

(2) Hadoop platform is good at dealing with massive data, due to limited hardware, 
only in the virtual machine to achieve a real cluster structures, its efficiency in large- 
scale data can not be effectively tested. 
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Abstract. Network attack can invalidate the connectivity of the resource 
network topology composed of routers, switches and other resources. This type 
of structural vulnerability which is caused by increasing scale of nodes in the 
network is a hotspot of current researches. In order to integrate the random attack 
and the targeted attack, we proposed an asymmetric information attack model 
which is closer to reality. In this attack model, we use the attack range and the 
node detection degree to adjust the attack mode and these two parameters extend 
the attack mode more than the random attack and the targeted attack. In this paper, 
we apply our attack model to attack BA network, ER network and Router network 
under different parameters. Then we find the random attack is better than other 
attack modes with nonzero node detection degree in ER network. And BA 
network is fragile to nonzero node detection degree attack mode. In addition, we 
also notice that although the distribution of Router network and BA network both 
satisfy the power law distribution, they show different structural vulnerability. 
The random attack has a better effect than the asymmetric information attack with 
nonzero node detection degree and attack range. Router network has the same 
structural vulnerability with ER network, which means Router network also has 
randomness. 


Keywords: Structural vulnerability - Asymmetric information 
Random attack - Targeted attacks 


1 Introduction 


With the rapid development of the information society, some information systems such 
as the distributed network equipment, computers, databases and the application software 
has attracted particular interests, they have higher resources sharing speed and stronger 
coordinated ability but meanwhile the structure may be complex. To analyze the influ- 
ence of resources topology to communication, we need to detect the basic network 
topology. We are capable of sending data packets to any place through a computer 
terminal. By changing the survival time of the data packet we can find IP of routers. 
This process can be realized by the computer tool “Traceroute” [1]. Then we can use 
this tool to combine up a great deal of node’s routing tracking paths to detect the whole 
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network topology [2]. In addition, we can refer to the routing list information stored in 
routers to complete the whole network topology. The so-called routing list information 
is the connected relationship with other routing in the network. Rather than studying the 
routing of the network, we would like to pay more attention to the connectivity and other 
properties of the network. The present researches on the complex information system 
mostly concentrated on the connectivity of the network. 

In an ideal world, network topology is connected. Meanwhile the number of network 
attack rises, such as Trojan, botnets, computer viruses, worms, denial of service attacks, 
web page tampering, domain name hijacking and so on. These can lead to nodes or links 
failed. Single node or link which is removed from the network can cause damage to the 
original connected network and result in communication failure. The robust of the network 
is that the network still has strong connectivity after deleting several nodes or links. On the 
contrary, if the network is no longer connected due to the emergence of a large number of 
isolated nodes, we consider the network to be vulnerable. The vulnerability caused by the 
change of network topology is called the structural vulnerability. Considering that the 
influence of the failure of nodes and links to the network is equal, in this paper, we just talk 
about the failure of nodes. There are two main types of existing way of attack: Random 
attack [3, 4] and Targeted attack [5, 6]. Random attack randomly selects the nodes in the 
network to attack and this attack way has strong randomness. Targeted attack is ranging the 
nodes in network by the nodes centrality and select the most important nodes to delete. 
Many researches [7—9] compared two attack methods but they did not point out the rela- 
tionship between them. We think that these two attack forms can use a common attack 
model to combine together. How to put forward a common attack model to combine two 
attack methods is our research focus. 

We also studied the researches about the structural vulnerability. Albert [10] found 
that, in the random network, when removed node number exceeds a threshold, the network 
will become fragmented and lose the network connectivity. And in scale-free networks 
when the threshold phenomenon disappears, deleted node number and largest number of 
nodes in the connected subgraph synchronously reduce in proportion. Li [11] proposed an 
attack method based on the Maximal Vertex Coverage and conducted a lot of experiment 
simulations to prove that different networks have different network structural vulnerability 
in MVC attack forms. Li [12] used Percolation Theory to analyze network reliability. The 
failure of network can be regarded as a percolation process. Effective nodes are corre- 
sponded to occupied nodes in percolation process and failure nodes are corresponded to 
blank nodes in percolation process. Through the network simulation analysis he got the 
lifetime of the network nodes, and when the scale of the nodes is larger, the network has a 
longer life. Ye [13] reconstructed the network to relatively small scale-free networks in 
order to increase the robustness of the network. Tanizawa [14] described random attack and 
targeted attack as a series of waves, so as to put forward a series of anti-destroying ability 
optimization schemes. But most researches focus on comparing network structural vulner- 
ability under a specific attack mode, and assume that the network is known, ignoring the 
fact that the attacker may not get the whole information of the network, and the differ- 
ences between the defenders and attackers. In the process of actual attack, the detected 
network and real network are different due to some factors such as technology and 
resources. This reflects the information asymmetry between the attackers and defenders. 
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In order to acquire a better understanding of the structural vulnerability, we need to verify 
the simulation results from the aspects of theory. But the study on network structural 
vulnerability theory has only a few number of researches. Karrer [15] completed the theo- 
retical derivation of percolation theory in sparse network. The percolation process simula- 
tion is a process of information transmission. This provided theoretical basis for the spread 
of disease and network attack. Newman [16] analyzed the max connected component in 
random network and the result showed max connected component represent the network 
connection. Di [17] combined assortativity and percolation theory together, and he thought 
if the independence of the nodes in the network is added, the robustness of the network can 
effectively increase. This part of the theoretical research, however, cannot completely fit the 
simulation results, and sometimes it is not feasible in large-scale networks. 

Accordingly, in view of the information differences between attacker and defender, we 
put forward an asymmetric information model in this article. Based on the model we 
proposed, we simulate in different networks and find that BA networks show vulnerability 
under the asymmetric information of nonzero degree of node detection. From simulations 
on the Routing network, the degree distribution of BA network and Routing network meets 
the power law distribution, but Routing network shows the different structural vulnera- 
bility to BA network and the same to ER network. This suggests that with the node scale 
of the network increasing the real network will show some randomness. 

The chapters are arranged as follow. Section 2 introduces the random attack and the 
targeted attack, and puts forward the asymmetric information attack model, to model the 
network attack. Section 3 uses the proposed attack model to measure the network struc- 
tural vulnerability. By attacking BA network we observe the correlation between coeffi- 
cients in the model and different performances under different attack forms. After that, we 
verify the Routing network and BA network both meet the power law distribution and we 
do simulations on the Router network and compare the results between other networks. 
Section 4 summarizes the research achievements of this paper. 


2 Attack Mode 


The targets of the network attack are generally the topology of network resouces such as 
routes and switches, which can be abstracted as the graph G(N, E). Here N represents the 
nodes in the network while E represents the edges. Let adjacency matrix A stand for the 
connectivity of the network. A is an N*N matrix. If node i and node j are connected, a; = 1, 
else a; = 0. When the nodes are under the attack caused by system vulnerability or other 
reasons and fail in the topology of controlled resources, a disturbance occurs and the adja- 
cency matrix A will change into Ay_; = A—A,. Here, k is the proportion of removed 
nodes. The analysis of structural vulnerability is equivalent to analyze the connectivity 
difference between A and Ay_,. We introduce R to represent the index of evaluating the 
attack performance and R has many different definitions. In this paper, we choose the 
largest connected branch as the major measuring object and use the local redundancy of 
specific node as a supplement. Concrete expressions will be shown in Sect. 3. 
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2.1 Random Attack and Targeted Attack 


In the network under a random attack, nodes are deleted and the value of R would be 
changed. To compare different indexes side by side, we normalize R to R“. The boundary 
values of R” can be defined as follow. 


Rll = (min[R*(—)], minf[R* (2), ...,min[R*(1)]} (1) 
Ryan Al = {max[R"(—)], max[R*(2)], ...,max[R*(1)]}, (2) 


where k in R'[k] represents the proportion of removed nodes. We can get mean values 
denoted as, 


R; [K] = (EIRG), EIR (S), EIR) 3) 


It can be confirmed that finding a node set to minimize the R [k] is a NP hard problem 
[18], and the random attack fits the mean value of R“[k]. In the case of a specific network 
topology, we can get the probability of each node being selected when the nodes attacked 
are randomly chosen, which can be denoted as, 


111 1 
P sa] 
a N N'N N (4) 


This kind of attack doesn’t need too much algorithm complexity, and can finish the 
choosing process quickly. But the attack effect is not ideal in some networks, this will be 
reflected in the following simulations. 

Targeted attack is to delete the nodes according to node centrality. That means the 
attacker has to detect the whole network first. This attack method has high algorithm 
complexity and needs to compare each node in the network. The node centrality [19, 20] 
includes degree centrality, closeness centrality and betweenness centrality, etc. In this 
paper, we apply the degree centrality. That is to say, the node with a higher degree value 
has a greater priority to be deleted. Targeted attack has different effects in different 
networks. The first step is to sort the nodes according to the degree value in the detected 
network. 


Centrality {n;, n3, Ny} ny en > Ny (5) 


At this time, to all nodes in the network, the probability distribution of being chosen is 
P{1,0,0,...,0}. Targeted attacks only select one node as the target at a time which ensures 
the accuracy of the attack. We can assume that the targeted attack fits the minimum value 
of R. We will do simulations and further explanations in Sect. 3. 
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2.2 Asymmetric Information Attack 


In the process of the network attack, the attacker cannot get the entire topology informa- 
tion of the network. The amount of information he gets depends on his resource, tech- 
nology and so on. The adjacency matrix the attacker detects A’ must has a perturbation 
denoted as A’ = A + A~. A’ is the error caused by the limitation of attacker’s technology 
or resources. The attacker evaluates the node importance and deletes the nodes based on the 
information he has got. At this time the effect to the detected network will reflect on the 
real network. This whole process is called the asymmetric information attack. The attacker 
cannot get the whole information of the network and the defender cannot know the choice 
of the attacker or what he knows about the network. Therefore we should model this process. 

To attack network G(N, E) with nodes N{n;, n2 n3 ... ng}, we should sort nodes by 
centrality to confirm the attack performance. In this paper, we apply the degree centrality 
as the importance of nodes. And we assume the attacker is a rational person and has 
restricted resources. That is to say, in the attack to a specific network, he cannot know the 
entire topology of the network and will choose a certain number of nodes to delete. Based 
on that, we put forward two parameters as follow. 


(1) Attack range H. First we calculate the degree of all nodes and sorting them. H 
represents the proportion of the detected nodes. We define H = i/n. And i represents 
the number of detected nodes, n represents the number of all nodes in the network. 

(2) Detection degree of node F. When attacking network, A and A’ are different. There 
must be error between them. We use the percentage to represent detection degree 
of node F. F is the ratio of the detected network information to the whole network 
topology information, ranging from 0 to 1. 


With restricted resource, F and H must belong to 0 to 1. It is easy to know that if we 
want to get higher F and H, we must spend more resource. So we define the resource cost 
coefficient as follow. 


C=HxXF Ceé[(0,1] (6) 


It can be seen that when C is higher, the algorithm is more complex and the time cost 
is more. So we can know for the random attack C = 0 and for the targeted attack C = 1. 
Assuming that the attacker spends the same resource on each deleted nodes denoted as G. 
x is the number of deleted nodes. And we can get that the optimal attack plan must meet 


P{Max[ACon]|Min[G(x« — i) + R()]} (7) 
where R(i) = G'(i + C Xi) is the resource spent under the asymmetric information 
attack. /\Con is the effect to the connectivity of the network. 

2.3 Attack Model 


From the definition of the asymmetric information attack, we can find that it includes both 
the random attack and the targeted attack. As shown in Fig. 1, light red nodes represent the 
undetected nodes and red nodes are the detected nodes with high degree under the certain 
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F. We can see that when F is not big enough, the attacker cannot find all nodes with big 
degree. The lack of the amount of the information attackers acquires leads to inaccurate 
attacks. 


Detection degree of node 


F=0 F belongs to 0 - 1 Detection degree of node 


F=1 


Targeted 
attack 





Random 
attack 





Fig. 1. Asymmetric information attack graph (Color figure online) 


From Table 1, we can further understand the relationship between parameter and attack 
mode. When F = 0, the attack mode is random attack. At this time H becomes meaning- 
less. When F = 1, the mode turns into the targeted attack. At this time, H is actually the 
number of the deleted nodes. When F € (0, 1), it is the asymmetric information attack. The 
probability of selecting important nodes depends on F. Attack range H corresponds to the 
number of selected nodes. It can be observed that bigger H can make up for F. This 
phenomenon will be showed and analyzed in later simulation. 


Table 1. The mapping relationship between parameter and attack mode 





H Attack mode 

0 oO | Random attack 

1 Targeted attack 

(0, 1) Asymmetric information attack 


According to attacker’s resource we can get H and F, and the number of deleted nodes 
x. To prevent nodes with big degree from repeatedly showing up in the sample, we module 
the whole simulation process as the non-return unequal probability sampling. The whole 
attack process is defined as follow: 


Step 1: Based on H, we calculate the number of sample nodes 1, which meets 1 = HXN 

Step 2: Calculate all nodes’ degree in the network and sort them. 

Step 3: Respectively give all the nodes in the network the sample rate. For the node 
with the biggest degree, F is the sample rate. Other nodes’ sample rates are 
calculated as follow, 
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N,=(1-F)'F (8) 


Step 4: Select one node according to the sample rate. Then compare the number of the 
sample nodes with 1. If smaller, go to Step 2, else go next. 

Step 5: When x <1 just delete x nodes of the sample nodes. When x > 1 first delete the 
sample nodes, then delete x—1 nodes randomly. 


In the real world, when the attacker launches an attack against a network, the attacker 
cannot get the whole information of the network and the defender cannot know the 
choice of the attacker or what he knows about the network. Supposed that the attacker 
is rational and his resource is limited, his best choice is to attack the most important 
nodes in the network. The importance of the node depends on two aspect, the attacker’s 
knowledge of the network and the index of the importance. For the first aspect, we 
proposed two parameters, the attack arrange H and the detection degree of node F, which 
represents the proportion of the detected nodes and the ratio of the detected network 
information to the whole network topology information respectively. Or simply, H 
shows how many nodes have been detected and F shows, for the detected nodes, how 
much the attacker knows. As for the second aspect, the indicator of the importance is 
usually the centrality of the nodes. In this paper, we use the degree value, and it is obvious 
that for a single node, more degree value means it is connected to more nodes. And that 
means if the node is attacked, there will be a greater effect. Thus, we say the more the 
degree value of the node is, the more important the node is. 


3 Structural Vulnerability Measurement and Impact Analysis 


3.1 Attack Performance Evaluation Indicators 


R measures the connectivity of the network. The common measurement of the connec- 
tivity of the network is the number of the nodes that belongs to the largest connected 
component. The redundancy of the network is also used to measure the network, espe- 
cially the robustness of the network. R has two definitions explained as follow. R,, 1s 
the number of nodes in the largest connected branch 


R,, = {n,|iff An, E€ R„pAe; = ii} (9) 


The largest connected branch is the carrier of the network traffic. The number of 
nodes in the largest connected branch measures the connectivity of the whole network 
after the network is attacked, and is also an important index to judge whether the network 
is failed. 

R, is the local redundancy of the specific node in the network. It can show the 
robustness of the node against the attack. It is also a measurement of connectivity. It is 
defined as follow. 


R,=— DY min(do),d() (10) 


ST qe) 
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Node redundancy rate refers to the ratio of the number of the paths a node goes 
through to arrive to its “neighbors’ neighbors” to the number of the paths in the complete 


graph. I’, represents the set of neighbor nodes. r? is the set of nodes’ “neighbors’ neigh- 


bors”. ISI is the complete graph consisting of the nodes, I’, and I”. 


3.2 Structural Vulnerability Analysis on BA Network 


In order to test different performances of the network under various attack modes, we 
use BA network for simulations. BA network is a sparse network with 5000 nodes and 
5000 edges. In simulations, we use Spyder as the simulation software and python as the 
simulation platform. Besides we use networkx module to call the specific function to 
generate BA networks. During the experiments we assume that the attacker prepares to 
delete 100 nodes. Then we adjust the value of H and F and observe the change of 
parameters in the network, and based on the simulation results we analyze the different 
properties of the network vulnerability under different H and F. In order to guarantee 
the experimental accuracy, we do each experiment for 50 times, then average them as 
the final results and plot curves. 


Table 2. Asymmetric information attack algorithm 
Input: {G,N,H,F,x} 
Output: {G’, P(%>P),P>---P;) , Figure } 
1: Input {G,N,H,F,x} 
2:  edges(1),nodes (k), linklist 
3: calculate i 


4: sorted degree N, =[7,,n,,n...n;] 


4: Px =[P,, P2; P3---P; | 
5: use distr.rvs choose n; refer to pk 


6:  ifn<i go to 4 else go next 


7: delete n; in order and get p 


8: Figure show 


9: return {G’, P(% Pi P-P) , Figure} 
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Table 2 is the algorithm for using the number of the nodes belong to the largest 
connected component to measure the connectivity of the network after the network 


attacked. 
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Fig. 2. The change of R,, under different H and F in BA network 
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First we simulate the attack on the BA network (Fig. 2). The attack range H is 
respectively 0.001, 0.002 and 0.005, and the detection degree of node is respectively 
0.1, 0.5 and 0.9. In figures, the horizontal axis represents the number of the deleted nodes 
and the vertical axis is the number of nodes in the largest connected branch. Based on 
the simulation results it can be found that: 


(1) Asa whole, no matter what strategy the attack process is based on, after deleting 
100 nodes, the curve tends to be gentle and the network shows its robustness. That 
is to say, BA network will not be directly completely broken after deleting a certain 
number of nodes. However, at this time, the network may not be able to commu- 
nicate. Therefore, to the attacker, it is not necessary to delete all 100 nodes to destroy 
the network. 

(2) Comparing 2-a, 2-b, 2-c we can find that when H and F are big enough, the curves 
go down rapidly, and then go to a plain. At this time the network is failed. When 
H = 0.005 F = 0.9, deleting 20 nodes is enough to make the network failed. It is 
obvious that these 20 nodes are important to the whole network connectivity. The 
attacker is a rational person with limited resources, as a result he will choose the 
most important nodes to attack. 

(3) For the supplement effect (H and F), it can be proved from the aspect of algorithm. 
From 2-a, 2-b, 2-c, when H is fixed, F just need to reach a certain value to make 
network failed quickly. As in figures, curves (F = 0.5 F = 0.9) are close to each 
other. From 2-d, 2-e, 2-f when F is fixed, H reaches a certain value to make curves 
close to each other. We use the resource cost coefficient C to evaluate the attack 
performance. Obviously when C is smaller, the attack cost is lower and corre- 
spondingly, the protect cost is lower. How to balance the relationship between the 
cost and the effect is a big issue. We can also find that the BA network is a multi- 
center network. After deleting several center nodes, the network tends to be broken. 


3.3 Compared with Other Indicators 


Figure 3 is the change of redundancy of the specific node (the node is same as 3.2). From 
this figure we can find that the specific node’s local redundancy falls much. It indicates 
that the attacker deletes many neighbor nodes or “‘neighbor’s neighbor’ nodes of the 
specific node. Eventually, the lines all become to flatten, which shows certain robustness. 
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Fig. 3. The change of redundancy in BA network 
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Comparing R,, with R,, we can find that the two indicators can both be used to 
measure the connectivity of the network, and can both reflect the robustness of the 
network. The difference between the two indicators is that they indicate the properties 
of the network from different aspects. R, is used to analyze single node’s connectivity 
and R,, is used to measure the whole network’s connectivity. The connectivity of single 
node and the whole network are different but relevant. When there is a need for evalu- 
ating important nodes, we will select the R, to be the indicator. When we need to evaluate 
the robustness of the network structure, we select the R,,. From the analysis of BA 
network, the effect of the nonrandom attack (F > 0 H > 0) is better than random attack 
(F = 0 H = 0). When F and H are bigger, the attack effect is more obvious. And in the 
beginning of the attack, the connectivity of the network reduces quickly. When the 
network shows robust and the curves become gentle, the effect of the random attack and 
the nonrandom attack (F > 0 H > 0) are tend to be the same. 


3.4 Simulations on Router Network 


From analysis of Sect. 3 we can get the structural vulnerability of BA network. We can 
compare real network with BA network to get more information about the structural 
vulnerability. Router network has 192244 nodes and 609066 edges. 


* Router network 
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200000 —— H=0 F=0 
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Fig. 4. Simulations on Router network. (a) the degree distribution of Router network (b) the 
change of R,, during the attack of Router network 


Figure 4a is the degree distribution of Router network. And x axis is degree; y axis 
is the number of nodes. Degree and number of nodes meet linear relationship. So the 
degree distribution of Router network and BA network obey power law distribution. 
From Fig. 4b we can see after deleting 1000 nodes, the network becomes to be failed. 
The curve at the top is the asymmetric information attack which has H = 0.0003 and 
F = 0.9, and the random attack (F = 0 H = 0) is under the curve (H = 0.0003 F = 0.9). 
So the effect of the random attack 1s better than the asymmetric information attack. We 
can conclude that the effect of the asymmetric information attack in BA network is 
opposite to that in Router network. But the degree distribution of the two networks both 
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obey the power law distribution. So the structural vulnerability does not only depend on 
the degree distribution. In order to fully understand the reason of this phenomenon, we 
do simulations on ER network which has different structural vulnerability with BA 
network. 

ER network has 5000 nodes and 12500 edges. Corresponding parameters m, n are 
5000 and 0.0005. Here we attack ER network (Fig. 5) we can find that. 


(1) From Fig. 5a random attack (H = 0 F = 0) is at the bottom which states that the 
effect of random attack on ER network is best. And the bigger F is, the worse the 
effect is. 

(2) In Fig. 5b three curves almost coincide. It indicates that when the H is less than a 
certain value F has no influence. This is because the effect of the asymmetric infor- 
mation attack (F > 0) is tiny in ER network. 
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Fig. 5. The change of R,, under different H and F in ER network 


The result shows that ER network is robust against the attack (F > 0) and the random 
attack has the best effect on ER network. ER network is a random network. The nodes 
of the network have similar degree value, and the status of each node is the same. If we 
use the asymmetric information attack (F > 0), can not improve the attack effect not 
only, rise instead counteractive. Compared with BA network the curves of attack (F > 0) 


are above the random attack so the targeted attack cannot fit R min[k]. 

Thus, the effect of the asymmetric information attack on ER network is opposite to 
that on BA network but is same with that on Router network. Although Router network 
shares the same degree distribution with BA network, its structural vulnerability is more 
close to ER network. This shows that the real network has certain randomness. 


4 Conclusion 


In this article, through the analysis of the random attack and the targeted attack, we 
proposed an asymmetric information attack model, and the random attack and the 
targeted attack are special cases of the attack model. Using node degree as the indicator, 
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we do simulations on the ER, BA and Routing network, and analyze the contact among 
the random attack, the targeted attack and the asymmetric information attack. Based on 
the results, we reveal different structural vulnerability of different networks. Simulation 
shows that BA network is fragile to the asymmetric information attack (F > 0). And ER 
network shows robust to the attack (F > 0). Because the scale of the real network is large 
and the real network has randomness, it is robust to the asymmetric information attack 
(F > 0) when the attack uses degree centrality to measure nodes’ importance. 
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Abstract. With the rapid growth of multimedia data, cross-modal retrieval has 
received great attention. Generally, learning semantics correlation is the primary 
solution for eliminating heterogeneous gap between modalities. Existing 
approaches usually focus on modeling cross-modal correlation and category 
correlation, which can’t capture semantic correlation thoroughly for social 
multimedia data. In fact, the diverse link information is complementary to 
provide rich hints for semantic correlation. In this paper, we propose a novel 
cross-modal correlation learning approach based on subspace learning by taking 
heterogeneous social link and content information into account. Both 
intra-modal and inter-modal correlation are simultaneously considered through 
explicitly modeling link information. Additionally, those correlations are 
incorporated into final representation, which further improve the performance of 
cross modal retrieval effectively. Experimental results demonstrate that the 
proposed approach performs better comparing with several state-of-the-art 
cross-modal correlation learning approaches. 


Keywords: Cross-modal retrieval © Correlation learning - Social multimedia 
Heterogeneous networks 


1 Introduction 


With the rapid development of multimedia technology, there has been a massive 
explosion of multimedia data on social media websites, which makes the traditional 
social media show the trend of multimedia [1]. In face of large amounts of complex 
social multimedia data, retrieving valuable information is of great significance. Con- 
sequently, cross-modal retrieval attracts considerable attention, in which users can 
input any modalities data at hand to query relevant information of other modalities. 
Different from traditional single-modal retrieval, it is more comprehensive and can 
meet the increasing user demands. Generally, learning semantic correlation is the main 
solution for eliminating heterogeneous gap between modalities for cross-modal 
retrieval. Nevertheless, these data do not exist in isolation in social multimedia, which 
makes correlation learning more challenging. On the one hand, different modalities of 
multimedia data are usually in coexistence. For example, in image sharing website, 
users usually share images accompanied by some text to illustrate. On the other hand, 
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these multimedia data are closely associated with the multi social factors, such as user, 
group and location information, which connect different social components into 
heterogeneous social multimedia networks. 

Although, several efforts have been paid to correlation learning. Most of existing 
methods focus on modeling cross-modal correlation and category correlation, such as 
canonical correlation analysis [2], cross-modal factor analysis [3] and semantic corre- 
lation matching [4]. However, those methods fail to model the correlation thoroughly for 
social multimedia data. Much social correlation between two objects is not exploited 
adequately. For instance, image and text may be connected by the same user, which 
indicates they have some semantic correlation. In fact, the complex heterogeneous link 
structure is complementary to provide the rich hints for semantic correlation and can be 
exploited to bridge the heterogeneous gap to some extent. Therefore, both link and 
content information are should be captured to improve the retrieval performance. 

In this paper, we propose a novel correlation learning approach based on subspace 
learning. It jointly considers both heterogeneous link and multimedia content, which is 
ignored by previous works. In this learning framework, firstly, multiple social links are 
transformed into both intra-modal and inter-modal correlation via heterogeneous net- 
works. Then heterogeneous modalities are projected into a unified subspace according 
to those correlation so that the similarity between different modalities can be measured 
through projection matrices. The proposed approach is experimentally evaluated better 
than other prevailing approaches on a cross-modal dataset. 

Along this line of research, there are two main contributions of proposed approach. 
On one hand, instead of treating each semantic link equally, we design a weight 
learning approach keeping link structure consistence with content information. On the 
other hand, we propose incorporating heterogeneous link structure and content infor- 
mation into the unified feature representation which are not only helpful to bridge the 
heterogeneous gap, but also robust to noise. 

The remainder of this paper is organized as follows. The related work on corre- 
lation learning for cross-modal retrieval is reviewed in Sect. 2. Then we present the 
proposed combining link and content correlation learning approach in Sect. 3. In 
Sect. 4, experiments are shown to verify the effectiveness of proposed method. Finally, 
we conclude this paper in Sect. 5. 


2 Related Works 


It is of considerable challenge to learning semantic correlation of heterogeneous 
modalities. Generally speaking, existing efforts are roughly divided into two aspects: 
the one is mapping the different features into a unified feature space based on subspace 
learning so that similarity between heterogeneous modalities can be computed, and the 
other is modeling the probability that learns a set of shared latent topics for different 
modalities. 

For subspace learning, there are many representative approaches like canonical 
correlation analysis, deep canonical correlation analysis [5], and partial least square [6]. 
Those methods usually maximize the correlation between two different modalities and 
learn the joint feature representation. While most of them do not take high-level 
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semantic correlation into account. To consider label information, Rasiwasia et al. 
propose combining subspace and semantic modeling to improve retrieval accuracy. 
Huang and Peng [7] present exploiting fine-grained correlation by taking entity level 
into account to construct high-level concepts. Moreover, further study jointly model 
labeled information and unlabeled information with graph regularization in a 
semi-supervised learning framework [8]. As for probabilistic model, Jia et al. [9] 
proposed a probabilistic model seen as Markov random field of topic models, which 
establish correlation among modalities based on their similarity. Han and Thomas [10] 
proposed matching images and sounds with tags which treated as shared latent vari- 
ables by combining Latent Dirichlet Allocation and Correspondence Latent Dirichlet 
Allocation model. However, these methods often take strict assumptions that there exist 
the same topic proportions or pairwise topic correspondences between different 
modalities, which are not satisfied for social multimedia data. 

Generally, there is an important principle for correlation leaning. It is the both 
inter-modal and intra-modal correlation that should be preserved [11]. That is, if two 
objects are closely related in original space, they should be closed to each other in 
latent subspace. The coexistence information is usually taken as intermodal correlation. 
The intra-modal correlation is usually provided by high-level category information. In 
this paper, we transform the rich link structure into the intra-modal and inter-modal 
correlation, which can further improve the performance of semantic correlation. 


3 Combining Link and Content Correlation Learning 


3.1 Overview of Proposed Framework 


To begin with, the problem formulation is definited in this section. Given a set of training 
dataset D = {X, Z}, in which X = {x,,x2,...,x,} denotes images, Z = {Z1, Z2,..-, Zn} 
denotes texts. While X = {x1,x2,...,X,} © R” and Z={z,22,...,Zn} € R” 
represent the image and text feature matrix respectively. In this case, the number of 
training samples is n, and the feature dimensionalities of image and text are dı and d2 
respectively. The main goal of correlation learning is to learn two projection matrices 
U c R” and V € RE” for image and text domain, so that the similarity between 
projected objects Sim(U"x;, V‘z;) can be computed. In other words, the semantic cor- 
relation is built among heterogeneous modalities in same dimensional subspace, so that 
cross-modal retrieval can be performed. 

Next, an effective and concise learning framework of proposed approach is intro- 
duced briefly as illustrated in Fig. 1. This approach mainly consists of two phases. For 
the first step, the heterogeneous social multimedia network is treated as a kind of 
heterogeneous information network [12], in which the link-based similarity could be 
measured by meta-path. After that, heterogeneous network is transformed into multiple 
homogeneous networks and bipartite networks. Therefore, the link-based similarity 
relationship is encoded into intra-modal and intermodal correlation. For the second 
step, it learns projection matrices according to those correlation for each modal based 
on subspace learning. To state conveniently, we take two modalities for example in this 
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Fig. 1. The framework of the proposed method. 


framework, while it can easily extend to multi-modalities case. Based on this frame- 
work, the details are depicted for each step as follows. 


3.2 Learning Link-Based Correlation 


As mentioned above, similarity relationship should be preserved to learning semantic 
correlation. Nevertheless, existing study [13] has shown that content-based similarity 
sometimes may not reliable to determine the similarity between two objects. Thus, only 
content-based correlation may lead to unsatisfying results for correlation learning. Of 
course, the link-based correlation is also uncertain and contingency. Intuitively, link 
and content are complementary to each other, so combining them will achieve more 
robust performance. In this paper, we embed link-based correlation into content-based 
correlation to obtain more effective yet efficient feature representation. 

It is reasonable to assume that semantic relationship exist between two objects, if 
there exist link explicitly. Given a heterogeneous network, there are many semantic 
links to connect two objects, which are defined as meta-paths. Different meta-paths 
imply different semantic meanings. For example, if two images are uploaded by the 


p : upload~! upload , p 
same user, they are connected via “image — user — image” path, thus it can be 


assumed that the two images are partially similar through common user. Besides, there 


iii favor! favor . ns be upload7! 
are many other meta-paths, such as “image — user — image” path, “image — 


contact upload , 3 a locate ' locate! . X 
user — user — image” path, “image — location — image” path, and 


N upload™! belong belong! upload , : o : 
“image — user — group — user — image” path. The link-based similarity 


can be computed based on meta-path like Pathsim [14] which similarity measure 
formula is defined as follows, 
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where x; and x; are the same type, P is a meta-path, ed is the number of path 


instances between x; and xj, is that between x; and x;, and | is that 








P XjX; 
between x; and xj. 

Different meta-paths reflect different aspects of image or text similarity. Instead of 
treating each meta-path equally, we design a novel algorithm to learn different weights. 
The latent hypothesis is the link-based similarity s(x;,x;) is consistence with 
content-based similarity c(x;,x;). The set of meta-paths are denoted as p= 
IP1;P2;---;Pm| and the weights of different meta-paths are denoted as 
w = |W1,W2,---,Wm]. We minimize the following objective function to learn different 
weights, 


3 
= 


2 
2 
Lw) =X > (eli) A was” (x) | + alll (2) 
d= 


i=1 j=l 
where the first component ensures that link-based similarity is consistence with 
content-based similarity, and the second term is L2 regulation. It is worth paying 
attention that these similarities are normalized in [0, 1]. The above objective unction 
can solved by many off-the-shelf methods, such as Newton method or stochastic 
gradient decent method. 

After having all the weights for each meta-path, the link-based similarity can be 
computed by combining all the similarities that based on different meta-paths. 
Therefore, link-based correlation can be modeled by three weight matrices, of which 
each element w;; is expressed as below, 


m 
wąs”: (i,j) iandjare of the same type 
Wij = d=1 (3) 
Cij otherwise 


where the cj denotes whether the link exist or not. In this paper, if there exist meta-path 


i coexist! ; , z 
“image — text” between i and j, we set cj = 1, otherwise we set cj = 0. Fur- 


thermore, multi-modal fusion is performed to get a common w;; for image and text 
domain in this research. 


3.3 Learning Cross-Modal Correlation 


So far, we have encoded the complex link information into both inter-modal and 
intra-modal correlation. Our ultimate target is realizing cross-modal retrieval, so we 
embed link-based correlation into content-based unified representation to learn 
semantic correlation in the common subspace. 

Essentially, similar objects should have similar feature representation. Then, the 
higher of the correlation between two objects, the distance between them in the joint 
subspace should be more close to each other. As mentioned above, to obtain better 
semantic correlation and more robust representation, both intra-modal and inter-modal 
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similarity relationship should be preserved when learning the common subspace. 
Hence, the loss function is formed by Frobenius norm as follows, 


min D w,l|U"x; — V"z)|)? 
UV icx jez (4) 
+4 D wy (JO — Us; Vz- V5) 


ijEX or Z 


where the first term is the distance between different modalities in the projected space 
and the second term is that between same modality, 2 is the parameter for balancing the 
inter-modal and intra-modal correlation. 

However, there are no obvious semantic meanings in the above projected space. In 
fact, the semantic label information can offer some useful guide information. Motived 
by this, we also add the label consistent term into above loss function. So our ultimate 
objective function is defined as follows, 


2 
min X wj ||U x; — V" zli 


om iEX JEZ 


+h SO wy(||U%: — UT Vz- Vall) (5) 


ijEX orZ 
21) 


2 2 
+B(|UTX - Yx ||} + |V7Z - ¥zl[;) +H (IU 
here, Yy and Yz are the initial label matrices on image and text domain respectively. 
||.||,., denote l2 ;-norm, which is used to sparse feature selection on projection matrices. 





laa FAV 





B and u are the balancing parameters. 
To solve the objective function in Eq. (5), we firstly simplify it. Inspired by 
Laplacian regularizer, we may arrive at a corresponding compact form as in Eq. (6). 


min||U"X — V”Z||; +A 1r(F(D — WF") 
+B(U7X — Yell; + [VTZ — Yell) +0(IUllor + IVl) 


where tr() denote the trace of matrix, D is a diagonal matrix with d(i,i) = X` wi, and 
J 


| (X 'U, Y'V). Note that the parameter à have been changed to 2A. While it does 
not affect the result, so we still use A for convenience. 

The above objective function can be optimized by an iterative algorithm, which 
solves for one variable while keeping the others fixed. Hence, the solution of U and V 
can be expressed in each iteration by Eqs. (7) and (8), 


U = ((1 + B)XX? + uR, +AX(D — W)X") ' (BXYZ +XZ'V) (7) 


V = ((1+B)ZZ" + uR, +4Z(D — W)Z™) ' (BZYZ + ZX7U) (8) 
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Where R, is a diagonal matrix [15], each element is defined as in Eq. (9), R, has 
analogous representation with R, as in Eq. (10). 


1 


en (9) 
ate |i, +e 


1 


r(i i) = — = 
2 > [Valls +e 
\ j 


in which £ is a smoothing term, and we set it to be a small constant value to avoid the 
denominator being zero. 

Once projection matrices for heterogeneous low-level features are learnt, all the 
data are readily projected into unified feature space. To perform cross-modal retrieval, 
the similarity between the query and the other modalities can be computed directly. 


r„(i, i) = 


(10) 


4 Experiments 


In this section, experiments are conducted to verify effectiveness of our method and 
investigate the performance by comparing with several state-of-the-art methods. 


4.1 Dataset 


We perform the experiments for cross-modal retrieval based on the large-scale 
NUS-WIDE dataset [16]. Each image is associated with some tags, which can be 
regarded as text information. The dataset contains almost 270,000 images with 5,018 
unique tags collected from Flickr website. We firstly find 10 largest classes and crawl 
the list of users who upload or favor a given image according to the image ID. Then, 
we choose the users owing more than 10 images and obtain 6487 unique users 
including 1418 authors and 5069 favorite users. What’s more, we also crawl the groups 
and contacts information for each author. 

In the learning link-based correlation stage, in order to catch the link-based 
intra-modal correlation as much as possible, we chose 5 kinds of meta-paths for images 
and texts as mentioned above. While, we only consider one meta-path 


“image “o “text” as inter-modal correlation. As for low-level feature, we take 
500-dimensional SIFT feature vectors for images and 1000-dimensional tag feature 
vectors for texts. Besides, we random select 80% of the image-text pairs used for 
training and 20% for testing to conduct experiments. 


4.2 Comparison Methods and Evaluation Metrics 


To verify the effectiveness, we compare our method with several state-of-the-art cor- 
relation learning methods including both unsupervised and supervised setting. 
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(1) Canonical Correlation Analysis (CCA): A traditional unsupervised method 
maximizes the correlation in the common subspace to learn the correlation 
matching between two heterogeneous features. Note that there are no obvious 
semantic meanings in the latent subspace. 

(2) Semantic Matching (SM): This approach is a supervised methods, which learns 
the correlation though transform the heterogeneous feature to the common 
semantic subspace. Thus the similarity between two objects can be computed 
according to the probability belonging to the same class. 

(3) Semantic Correlation Matching (SCM): This approach is also a supervised 
method, which combines CCA and SM. Firstly, it learns cross-modal correlation 
through maximize the correlated subspaces, and then logistic regression is per- 
formed to get the probability of each object belonging to all the classes. Both 
correlation analysis and semantic abstract are considered in this method. 


In addition, since the dataset is a multi-labeled dataset, we use mean average 
precision (MAP) to evaluate the retrieval performance which is a rank-based evaluation 
metric and widely used to multi-label classification. Moreover, precision-recall 
(PR) curves are also plotted to investigate the performance of different approaches. 


4.3 Experimental Result 


In this section, we compare the proposed combining link and content correlation 
(CLCC) approach with several prevailing approaches. Figure 2 shows the MAP per- 
formance of different approaches on two different cross-modal retrieval task, i.e. image 
to text retrieval task and text to image retrieval task. As can be seen that the method of 
SCM achieves better performance than CCA and SM which both consider correlation 
matching and semantic matching. It is also clear that our approach significantly out- 
performs other approaches, demonstrating the effectiveness of proposed approach. It is 
reasonable because we not only exploit the rich link structure to improve correlation, 
but also consider semantic projection in the subspace during the training stage. Note 
that different tasks may have different performance for an approach. While our 
approach obtains better performance for both retrieval tasks comparing with its several 
counterparts. 

Additionally, further analysis of the results is presented in Fig. 3 in terms of 
precision-recall curves. Obviously, SCM obtains higher precision than CCA and CM at 
all levels of recall for text to image retrieval task and almost levels of recall for image to 
text retrieval task. The same insight can be acquired about the performance improve- 
ments. Comparing the PR curves of CLCC with others, we can see that the precision of 
our method is higher than others at all levels of recall for both forms of cross-modal 
retrieval. 

In short, we can conclude that the semantic correlation can be exploited fully by 
combining link and content information for cross-modal retrieval. 
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Fig. 3. Precision recall curve of (a) image to text and (b) text to image retrieval task. 
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5 Conclusion 


In this paper, we examined the problem of correlation learning for cross-modal retrieval 
in heterogeneous social multimedia networks. We have proposed an effective learning 
approach by combining link and content information into together to improve the 
performance of semantic correlation in a unified projection subspace. Different from 
traditional approaches, we integrate rich link structure to obtain accurate and robust 
representation. Furthermore, there are obvious semantic meanings in the unified space 
through embedding semantic label information. Extensive experiments have verified 
the better performance of proposed approach comparing with several prevailing 
approaches. In the future, we intend to exploit nonlinear projection to obtain unified 
feature representation. 
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Abstract. This paper studies the gestures interaction of personal portable 
computer with user defined methods on different operating objects. According to 
the data analysis on experimental result, we developed a set of gestures for 
personal portable computer which can meet different users’ needs. In accordance 
with touch gestures, we can divide touch gestures into three categories: basic 
gestures, symbol gestures and combination gestures. Based on this gestures 
database, developer and designer can design touch gestures interaction which 
match the product features quickly and in a reasonable way. 


Keywords: Touch gestures - Interaction design - Personal portable computer 


1 Related Researches 


Touch screen technology has been applied to mobile phones for 16 years and the 
exploration of touch gesture interaction has always been the focus of researchers. 
< User-Defined Gestures for Surface Computing>> [1] written by Jacob O. Wobbrock 
etc. defined the commonly used gestures and basic gestures of desktop large screen, 
deeply discussed the influence of human cognitive behavior on hand gesture interac- 
tion. Based on user defined design method in intuitive interaction domain, foreign 
scholars designed Interactive behaviors on touch screen mobile phone, somatosensory 
operation device. Besides, there are also many scholars study on the pain point of the 
touch screen interaction [2—4], such as “click low precision”, “fat finger” and so on. 
Cedric Foucault et al designed a two-handed interaction model “Spad” [5] to enhance 
the productivity of tablet, this interactive model adopts non master hand to activate the 
functional mode and use a tablet application “Spad” interactive control mode cooperate 
with context components in “Keynote” to complete the same task. By comparing the 
operations confirmed that “Spad”? completes the task more efficiently without 
increasing the complexity. But “Spad” requires that all tasks be divided into four 
groups. Each of groups contains three buttons that brings functional limitations. 

To draw a conclusion, researchers have proposed their own design methods and 
solutions to the pain points of mobile phone touch screen gesture interaction. However, 
there is no research on the design of large screen personal portable computer to provide 
a unified guidance for personal portable computer touch screen gesture interaction. 


© Springer International Publishing AG 2018 
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2 Experimental Study 


2.1 Experimental Subject 


We invited 12 college students to participate in the experiment, all of them were 
graduate students, among them, 5 were female, and the others are male. Their age 
ranged from 20-25 7 formal experiment, there were no long time intense hand 
movements. All of them had used touch screen equipment in the past and are familiar 
with the touch operation. The subjects had 3 min to know the page layout and basic 
operations before task. 


2.2 Experimental Task 


The experiment presented three kinds of common operation tasks, which were the basic 
operation, the shape operation and the text operation while different target tasks had 
different pre conditions. In order to facilitate the experiment, the extracted tasks were 
divided into three usage scenarios. In order to avoid the contextual effects caused by 
different tasks, we arranged the sequence of the task process. This also could avoid 
interrupting some continuous operation tasks, such as Copy and Paste task (Table 1). 


2.3 Experimental Equipment 


The device used in the experiment were iPad Pro (12.9 in.), the operating system 
platform was IOS10.0, having smart keyboard. 


2.4 Experimental Process 


Subjects had 3 min to get familiar with the layout of the user interface. In the formal 
experiment. The task interface in the scene has prompt message which didn’t contain 
the contents of the gestures. Subjects needed to define gestures themselves to complete 
the corresponding task. In the experiment, sounding thinking was adopted, the subjects 
need to inform the main reason of some gestures. After each task, the subjects were 
asked to score the subjective perceptions of gestures, including the matching degree 
between the gestures and target task, the accessibility of operations etc. 


2.5 Data Treating 


564 motion data were recorded in the experiment. Based on the data, we summarized 
the frequency, the predictability, the consistency and the subjective evaluation of the 
gestures. 


2.5.1 Frequency 

Select the most frequent gestures as standard of experiment task and remove the 
gestures in poor consistency, we achieved the target collection of gesture. The corre- 
sponding gestures and their frequency are shown in below tables (Tables 2, 3 and 4). 


The Research on Touch Gestures Interaction Design 


Table 1. Experimental task operation 





Situation 1 


Pre condition None 


Create Slide 


Create slide with the same 


Experimental task format 


Create new position to new 
slide 


Create a copy for the slide 


Copy slide to the specified 
location 


Move slide 
Switch to browse view 
Switch to normal view 


Previous one 


Next one 


Play slide in current position 


End playing slide 


Play slide from beginning 
Delete slide 
Revoke 
Redo 
Zoom in this slide 
Zoom out this slide 


Check page object hierarchy 


Situation 2 


Select single, multiple, all 
shapes 


Copy shape 


Paste shape to the 
specified location 


Shear shape 


Delete shape 


Get into the state of 
editing 


Rotate shape 
Change size 
Modify fill color 
Modify border color 


Modify border’s type of 
line 


Combination 


Cancel combination 


Revision level 


Situation 3 


Select textbox 


Active edit status 


Move cursor position 


Select word 


Select phrase 


Select line text 


Choose the sentence 
Select whole segment 


Copy 


Paste 


Shear 


Delete 


Align (left, middle, right, 
both sides) 


Change font 
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Table 2. Basic operation gestures 





Task 


Building a new default 
format slide 


Specified location building a new 
default format slide 


Move the slide 


Switch to the slide show view 


On one piece 


The next 


The slide start playing from the 
current position 


Exit the slides 


Delete the slides 


Back out 


Reform 


Zoom in slide on this page 


Zoom out slides on this page 





Gestures 


Operating mode 


Double-click the blank area 


Double-click the interval area 


Press and hold to drag 


Five fingers grasping 


underscore 


On the cross 


Double-click 


Double-click 


Hold and slip out of the screen 


Counterclockwise 


Clockwise 


Expand 


Shrink 


Frequency 


10 


11 


11 
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Table 3. Shape operation gestures 





Pre 


condao Task Gestures Operating mode Frequency 


Access edit text h EP 
state 
l Delete ty Hold and slip out 7 
the screen 


l GX) CGH E Compatible with 
Multiple Rotating A three kinds of 7 


choice . 
rotation 


+ . 
Zoom in shape 1a") eae 8 
+ Two fingers 
Reduce the shape t? pinched 8 


ZO p 


Two fingers drag 
into the clipboard 


Multiple/ Cut One finger into f 
a the clipboard 
Paste One finger drag : 


the clipboard 


Copy 





Due to the current page layout will be trimmed by the more frequent task like copy, 
cut and paste, the page layout framework will have an impact on the process of user 
defined gestures, therefore, the final data of these three tasks are eliminated in the data 
Statistics. 


2.5.2 Predictability 
Through gestures set and the frequency of each gesture, the Predictability of the ges- 
tures set can be calculated: 


G= P 
Den s| , 100%’ 
PI 
In the formula: G stands for Predictability, P represents the total number of actions 
defined by the user in task, Ps represents the number set of “S”. s is a subset of S, the 
predictable sequence of tasks is shown in the following figure (Fig. 1). 
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Table 4. Text operation gestures 








rank Choice Task Gesture Operating mode Frequency 
calibration 
li h 
a Aats ty Hold and slip out the 5 
screen 
h | t 
H geg A) Two fingers extended J 
+ 
a = t ND) Two fingers pinched 7 
== Drag and drop two points 
Alignment P Y to target alignment 5 
direction 
Oo $ i 
Co Two fingers drag into the g 
PY clipboard 
One finger drag into 
st $ clipboard g 
One finger drag out of 
LaSi P clipboard j 
Predictable 
e WOWWH E> rPWIOZEREWWWw zW pyet 
b> Oe SRO RRS SURE ESS OG EET 
Wess zovanze2zCs4abir Ea nia 
Gur we OSA Oxr"%%LO2E5S 
Ee. Gg U OOF ZEE = 
< m N65 =D 09 
cc x N ZO% Ose 
00 NO 
N Ò N 
N 


Fig. 1. Predictable sequence of gestures for each task 
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2.5.3 Uniformity 
After calculating the predictable nature of gesture set, the consistency of gesture set can 
be calculated. 


Pi 
Dm > PICPr (a 


) 
A= - 100% 
RI 


In this formula: A indicates consistency score, R represents the total number of 
actions for the target task defined by the user, r is a character for the target mission R, P 
represents the total number of actions for a task R, Pi is a subset of Pr that represents 
the number of an actions (Fig. 2). 


Uniformity 
0.90 
0.80 
0.70 
0.60 
0.50 
0.40 
0.30 
= HHI | 
Dib | | | 
0.00 
W A. u wu W J — w Ww Ww Ww H 
Bus ORk ebou Re 5GRE aa SE Kke 
wZle zoUvdaz¥sOauwarrredgeers 
Su I wg Seu oQuuisozes 
o U E OF ZE 2 =I 
a = Noz2 =20 30% 
X m N2 292 OSZ 
0O NO 
N O N 
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Fig. 2. Sequence of gestures for each task 


2.5.4 User’s Subjective Evaluation 

From the following table, we can see the matching degree and accessibility scores. 
Creating a new slide, moving the slide, switching the slide, deleting the slide, slide 
zoom, activating text edition, shape zoom got higher scores, indicating the task 
operations in the gesture set are easily understood and implemented by the user at the 
cognitive level. While switching the slide and playing the slides are easy to implement, 
but the matching degree is low. Through the interviews we know that this is because 
the subjects afraid to conflict with the operating system in the process of defining the 
gesture, so that the results were also affected (Fig. 3). 
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User's subjective evaluation 
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Fig. 3. User’s subjective evaluation of each task 


3 Personal Portable Device Resture Ttype 


By summarizing the experimental data further, generally, the gestures can be used in 
personal portable computer are divided into: basic gestures, symbol gestures, combined 
gestures. 


3.1 Basic Gestures 


Basic gestures including: click, press, swipe, drag, zoom, rotates, complete the func- 
tional operation of common application, such gestures are versatile (Table 5). 


3.2 Basic Gestures 


Symbolic gestures follow the user’s daily life and graphical interface experience, 
continue to use the book written symbols or icon that from graphical user interface to 
the touch screen interface. For example, most users use the counterclockwise circle 
shape to represent revoking, the symbol is similar to the revoke icon in the graphical 
interface, and counterclockwise circle also gives the user a psychological hint of time 
reversal. In this way, you can extend the other gesture definitions of other functions, 
such as text bold, tilted, underlined function can be expressed through the gesture to 
draw the graphics that similar to the icon, for example, using “B”, “I, “U” to 
manipulate text attributes. Symbolic gestures rely on the user’s experience in the daily 
life and graphical interface experience, the symbol that can be used is limited, and its 
scalability is relatively low (Fig. 4). 
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Table 5. Symbolic gestures 
. . . ; : .. Pict h 
Main screen In list Thumbnail Detailed view Control Text editing n a à Eoi orm 
editing editing 
Open / Adjust 
Click the Adjust the coreenon dng item / Stop the Active contral Move cursor Select 
corresponding sliding item function position 
item 
i Intelligent 
Scale / Switch 
Double click = = “ : — = phrase Select Enter input status 
view 
selection 
Taple-click E _ E _ E Select whole _ E 
segment 
i Por Corresponding items get into edit _ EEIE T T 
K al Te E ig status, Activate hidden functions P rd 
editing 
m Mobile - ee Move cursor 
ingle finger r ation ition li a 
ae Š shortcut AR FN j Delete = position like Move position/Delete 
7 location fisheye / 
Delete 
Double finger 
drag copy copy 
; l . Switch t Horizontal , 
Horizontal Switch main Transfer anes movement / Operation _ E _ > 
sliding screen delete abina page Switch to control 
another 
Vertical 
aih Vertical ing to certai ing/Switch Operati 
Vertical slip | Notification ertical moving to certain moving ow c a - = = = = 
bar / toolbar page to anai S aii 
sibling page 
Four finger RA m . . 
ai Switching application / View the background running APP 
Two finger 
Return to retraction Three 
Zoom = = Scale = 
previous level or four finger 
switch view 
Five fi 
PERA Close the application back to the main screen 
zoom 
Rotate = = = Retie te Rotate — 


picture 
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Bo I u 
Fig. 4. Symbolic gesture manipulation 


3.3 Combination Gestures 


Combination gestures refer to the simultaneous completion of gestures through two 
hands or the completion of two different types of gestures in one hand, including three 
types of gestures: hands combination of homogeneous anisotropy gestures, hands 
combination of heterogeneous gestures, and one hand heterogeneous combination 
gestures. 


3.3.1 Hands Combination of Homogeneous Anisotropy Gestures 

Hands input requires the user to interact with hands of a high degree of coordination, 
coherence, avoiding the challenge of “Draw a circle with one hand, and draw a square 
with the other”. On the other hand, the input of both hands requires a high degree of 
concentration of the user, if the task is too complicated that the cognitive burden in the 
implementation of the task will be increased. So in most cases, the hands interactive 
gestures are mostly through the hands to complete the anisotropy homogeneous gesture 
operation, such as the “rotation” “text center alignment” “text on both sides of the 
alignment” etc. 


3.3.2 Hands Heterogeneous Combinations 

Hands heterogeneous combinations refer to the completion of static gestures by one 
hand, the completion of dynamic gestures by the other hand. Guiard believes that the 
operating mechanism of both hands is based on the co-operation and asymmetry of 
both hands [6]. The movement of the non-conventional hand is more granular than that 
of the conventional hand. Therefore, the non-conventional hand can be used to activate 
the operation mode of some function and the master hand can be used to operation or 
do some fine adjustment. In the experiment, users complete the multi-election operation 
through the “one finger long press, one finger click” or “one finger long press, one 
finger drag”, complete the “shape rotation” operation through “one finger hold the 
shape, one finger rotate the shape”. Such gestures increase the user’s ability to activate 
the gesture memory of the functional modifiers for non-master controls, but can 
improve the operational performance of the expert user. 


3.3.3 One Hand Heterogeneous Combination 

One hand heterogeneous combination refers to a group of different combinations of 
gestures that completed by one hand, such as the continuous touch gestures that Google 
created for Android device. In Google’s gestures, users are allowed to draw the letters 
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g” and circle the target search term continuous, or cover the target search with the 


66,99 


combination shape of “g” and “o”, you can search for the target vocabulary quickly. 
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4 Conclusion 


At present, as a “productivity tool”, large-screen personal portable computer has not yet 
formed a complete and systematic touch-screen operation model in the people’s 
understanding. We applied the method put forward by Wobbrock and others in the 
personal portable computer gestures interactive design, summarized a set of gesture set 
that for different operators of the personal portable computer. We also divided the 
personal portable computer gestures are into three categories of basic gestures, symbol 
gestures, and combined gestures in accordance with the gesture operation model, and 
summed up the specialty of three types of gestures. Based on this gesture set, it is 
convenient for the personal portable computer designers to build gestures that conform 
to the user’s mental model, and giving the user a more natural and intuitive interaction 
experience. 
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Abstract. For mobile Internet people’s personalized service needs, how to 
move from the vast number of mobile information in real time, accurate access 
to mobile users really interested in the content. In order to obtain more accurate 
mobile user’s preferences to meet the requirements of personalized services, this 
paper propose a new mobile user’s preference prediction method based on trust 
and link prediction by analyzing the mobile user behavior. Firstly, this paper 
propose a method to calculate the trust of mobile users by analyzing the 
behavior of mobile users; Then according to the similarity of the mobile user’s 
trust and the mobile user’s score, the approximate neighbor of the mobile user is 
selected; we use the link prediction method to calculate the correlation between 
mobile users and mobile network services and determine mobile network ser- 
vices that needed predict; Finally, we use this method to predict the user’s 
preference. The research shows that the prediction accuracy of this method is 
better than traditional method of Collaborative Filtering recommendation, which 
solves the sparsity problem to some extent. 


Keywords: User’s personality feature - Mobile agent system 
Collaborative Filtering (CF) - Trustworthiness - Mobile internet 


1 Introduction 


Due to the rapid development of mobile communication technology and intelligent 
mobile devices, mobile phones have become one of the main platforms for people to 
obtain information. With the development of multimedia technology and mobile 
information loading, mobile transmission capability, mobile e-commerce and mobile 
Internet Service Mall is also more people to use, but the huge mobile network service 
information for mobile users has brought serious overload problems of mobile infor- 
mation. Therefore, how to get the real interest of mobile users in real-time and accuracy 
from mobile information has become a difficult problem to be solved in personalized 
mobile network service [1, 2]. Among the many methods for predicting user prefer- 
ences, the collaborative filtering algorithm is the most classic and the most widely used. 
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However, when user preference prediction is carried out by using the traditional col- 
laborative filtering method, the prediction accuracy of the user preference is relatively 
low due to the sparsity problem. At the same time, due to the limited input and output 
capability of mobile phones and the characteristics of mobile user’s real-time access to 
information, mobile network users have higher demand on mobile users’ forecasting 
accuracy [3]. Traditional user preference forecasting methods are not suitable for 
personalized mobile network service system. 

The basic idea of collaborative filtering is to find the nearest neighbor of the target 
user by calculating the similarity between the users and then to predict the interest 
preference of the target user by the nearest interest preference of the neighbor. How- 
ever, with the expansion of data size, data sparsity problems gradually appear, it leads 
to algorithm efficiency and accuracy reducing. Some scholars mitigate this problem by 
integrating trust relationships between users into traditional collaborative filtering. In 
literature [4], the trust relationship between mobile users is calculated by the mobile 
user’s explicit score. However, in the mobile communication network, because the 
mobile phone’s input and output capability is limited, the explicit evaluation infor- 
mation between mobile users is relatively small, Generally, we use the implicit cal- 
culation method to obtain the trust relationship among mobile users. In literature [2], 
for the characteristics of mobile communication networks, the trust relationship 
between mobile users is introduced into the traditional collaborative filtering, and the 
trust relationship between mobile users is analyzed by analyzing the communication 
behavior between mobile users and gives a linear calculation method to calculate the 
trust of mobile users. In literature [5], it describes an e-mail-based social network 
computing method, which analyzes the communication behavior between users to build 
a social network, and through experiments to prove that based on the logarithm 
function to calculate the user trust relationship is more accurate than that through the 
linear function [6]. 

In addition, some researchers solve the sparsity problem from another aspect. In 
literature [7], it propose a collaborative filtering improvement method based on link 
prediction, which builds a network model that connects users and project nodes and by 
analyzing the relation between the user and the project can effectively reduce the 
sparsity problem existing in the traditional collaborative filtering. In the traditional link 
prediction method, node only represents the user/project, edge only represents the 
relationship between the user/project, and the literature [7] improved the traditional link 
prediction methods, nodes in the network represent users and items, and edges rep- 
resent relationships between the user and the project. Although the literature [7] 
reduces the sparsity problem of collaborative filtering through link prediction, it does 
not consider the impact of trust on mobile user preference prediction for the charac- 
teristics of mobile communication network. 

In this paper, we analyze the mobile user behavior to calculate the trust degree 
between mobile users and select the approximate neighbor of the target user by 
combining with the similarity of the mobile user’s preference; Then, the improved link 
prediction method is used to calculate the correlation information between the mobile 
user and the mobile network service, and the mobile network service which the target 
user is most likely to use is determined according to the calculated correlation; Finally, 
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based on the above research, a mobile user preference prediction method based on trust 
and link prediction is proposed. 


2 Mobile User Preference Prediction Model 


In order to improve the accuracy of mobile user preference forecasting, this paper 
introduces the trust degree between mobile users and the correlation between mobile 
user-mobile network services into the mobile user preference prediction model [8]. 


2.1 Data Model 


There are a large number of user communication and interactive information in the 
mobile communication network, mainly including the communication records such as 
voice communication, short message, flying letter and some evaluation information of 
the mobile network service by the mobile users. In this paper, we build the data model 
based on these data, including four data set [8]: 


(1) Mobile user Set: U = {u;|i € |1, N|}, in this collection N is the number of mobile 
subscribers 

(2) Mobile Web Services Set [9]: S = {s;|j € [1,M]}, The mobile network service 
includes basic communication service and value-added service, such as down- 
loading software, browsing web page, e-commerce, receiving and sending mail, 
customizing service. M denotes the number of mobile network services. 

(3) Mobile subscriber service network [10]: The evaluation behavior of the mobile 
user in the set U to the mobile network service in the set S is denoted by G,, 
< Vis, Eus >, Vus represents the set of mobile users and the set of all mobile 
network services that need to be processed, the edge e<u,s > €E E,, denotes the 
mobile network service s used by the mobile user u. For the weights w < u, s > 
between the edges, it is the rating of the mobile user u for the mobile network 
service s. The mobile users ‘rating of mobile network services is extracted from 
the feedback, evaluation and other records of the mobile services used by the 
mobile users, and combined with the mobile user’s usage of the related mobile 
Services. 

(4) Mobile users trust network [11]: The trust relationship between mobile users in 
the set U is expressed by G „u < Vuus Euu >, Which V,,,, denotes a set of all mobile 
communication network users, e<uj;,u; > € E,,, indicating that there is a trust 
relationship between the mobile users u; and u; . For the weight w < u;, u; > 
between the edge and the edge, it represents the trust degree of the mobile user uw 
to the mobile user u;. The trust degree between the mobile users is obtained by 
analyzing the communication behavior among the mobile users. If there are more 
connections between mobile users u; and u,, the trust between mobile users u; and 
u; is high, and on the contrary, is relatively low. 
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2.2 Mobile User Trust Calculation 


Combining the characteristics of mobile communication behavior and the transitivity of 
trust, this paper defines direct trust and indirect trust as follows: 


Definition 1 (direct trust): If there is communication between mobile users, and trust is 
greater than the set threshold A, then there is a direct trust relationship between mobile 
users. 


Definition 2 (indirect trust): If there is no direct trust relationship between mobile 
users, but there is a common trust neighbor, then there is an indirect trust relationship 
between users. 

The calculation procedure of trust among mobile users in mobile communication 
network is as follows: 


(1) Preprocessing mobile user communication behavior, deleting incomplete data and 
noise data; 

(2) According to the communication behavior between mobile users, the initial 
mobile user trust network G,,,,; < V, E > is constructed, that is, if there is a call or 
text message between the mobile user u; and u;, u; and u; are connected. Because 
the degree of trust between users is asymmetric, so G,,,,; < V, E > is a directed 
weighted graph [11]. 

(3) Calculate the direct trust between mobile users: With the increase of the amount of 
traffic, the trust degree between the mobile users tends to be stable with the 
increase of the amount of traffic, which is consistent with the theory of marginal 
utility decreasing. Therefore, this paper uses logarithmic function to express the 
relationship between the them. For example, the mobile user u; in his commu- 
nication record is mainly for text messages and call information statistics. First, 
the total number of statistical messages p and the number of messages p,, cor- 
responding to the mobile user u; serving as the receiving party are counted, and 
then the total time q of the user u; is counted, the total number of calls r and the 
mobile user u; serving as the called party and the total time q,,, the total number of 


calls r,,. Define T (u;, u;) as the trust of u; for mobile user u;, define T (my, m) as U; 
or uj trust in a communication behavior. In it, my, € (Pip uj: Tuy} and 
m € {p,q,r}, so T (my,,™) can be expressed as 


2(1 — 1/Inm,,) (1 — 1/In(m = m;)) 


T(m, m) ~ aL 1/ In m, — 1/In(m = Mu) 


(1) 


(4) The direct trust between mobile users can be expressed as 


(ay) = TOP) #7 ord) +702) M 


So, In Guu < V, E >, The weight between e < u;, u; > is T (u; uj); 
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(5) Each user node in U repeat (3) to form G,,,. < V, E >; 

(6) Calculate the indirect trust between mobile users. In the Mole Trust algorithm 
proposed in document [8], we introduce the attenuation factor into the trust cal- 
culation, and think that the trust degree is declining in the transmission process. 
According to Definition 2, when calculating the indirect trust degree, because 
there may be multiple co-trust neighbors between the two mobile user nodes, the 
indirect trust degree can be calculated by the method of trust synthesis [12]. Let 
the confidence factor in the transmission process is D, taking the mobile user 
nodes u; and uz as an example, if the node u; is the common trust neighbor, the 
indirect trust degree of u; through u; and node uz can be expressed as 


T;(u;, Ux) = D x Min (T (uj, uj), T (uj, ux) ) (3) 


For each of the mobile users u,i and uz, the trust neighbor u; calculates T; (u;, ux), and 
finally obtains the indirect trust degree of u; and ug. The formula can be expressed as 


(wis te) = J, Ty(wi, u)/L (4) 


Where L represents the total number of mobile users u; and up common trust 
neighbors. 


(7) For each pair of mobile user nodes satisfying Definition 2 for G,,. < V, E > 
carried out (5), forming a complete mobile user trust network G,,,3 < V, E >. 


2.3 User - Service Dependency Calculations Based on Link Forecasting 


Traditional collaborative filtering mainly predicts user preferences by calculating the 
similarity between users or between projects, and less consideration of the relationship 
between users and projects. In document [7], the data model established by link pre- 
diction shows the relationship between the two, and the improved link prediction 
method is given. Jaccard’s Coefficient is one of the commonly used methods in link 
prediction [7], but this method does not apply to the calculation of user-service rele- 
vance in link forecasting model, so this paper redefines it and gives the user-service 
correlation calculation formula 


ru) nT (s) 


ru UT(s) ©) 





corr(u, s) = 


where I (u) represents the set of neighbor nodes of node u, I (s) represents the set of 
neighbor nodes of node s and T (s) = peng lO: 


2.4 Mobile User Similarity Calculation 


In this paper, the Pearson correlation coefficient is used to calculate the similarity 
between mobile users [13]. The formula is expressed as 
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Fujs — Tuj Fu; s— Tu; . 
: Pesh 3 x s) z) | Suju;| tx min (|Su;|, |Su;|) 
sim(u;, uj) = D (Purs Fu) a an (Fujs Fu; ) (6) 


0, |Su;u;| <t x min(|Su;|, |Su;|) 





Where Su; and Su; represent the scoring service set of mobile users u; and u; 
respectively, Suju; represents the service set, Fu, and Fus of the common score of 
mobile user u; and u;, respectively represent the score of mobile user u; and u; to the 
service, 7,, represents the average score of the mobile user u; for the rating service. 


2.5 Mobile User Preference Prediction 


In order to improve the prediction result of mobile user preference [8, 11], this paper 
improves the traditional collaborative filtering method because of the low accuracy and 
sparseness of traditional collaborative filtering technology. In the approximate neighbor 
search target user, fusion as approximate neighbor weights will move the similarity 
between trust and mobile user preferences between users; and then through the link 
prediction method to determine the mobile network services for mobile users are most 
likely to use; finally improved by preference prediction formula to predict the mobile 
user preferences. Specific methods are as follows: 


(1) Determine the approximate neighbor set of the mobile user 


In the process of mobile user preference prediction, this paper not only considers 
the impact of similarity between mobile user preferences on forecasting results, but also 
considers the impact of mobile users’ trust on mobile user preferences and the har- 
monic neighbor is used to select the approximate neighbor set of the target user. 
Reconciliation weights are determined by the degree of similarity of scoring and the 
degree of trust of the user. When the trust between mobile users is relatively large (for 
example, mobile users u; and mobile users u; is a good friend), the trust of the weight 
value is relatively large; When the trust between mobile users is relatively small (for 
example, mobile users u; and mobile users u; is strangers), the similarity of the weight 
value is relatively large. So the formula for the harmonic weight is expressed as 


s dey a= at sim (ut;, uj) + PaT (uj, uj), T (ui, uj) >A 
W (ui, uj) E l azsim(u;i, uj) + PaT (ui, uj), T (7) 


Where W (u;, u;) represents the harmonic weight of the mobile user u; and the 
mobile user u;; sim(u;, u;) represents the similarity degree of the mobile user u; and the 
mobile user u;; T(u;, u;) represents the trust of the mobile user u; and the mobile user u;, 
1, %2, Py, fo are the harmonic factors, and a, + Pi = 1, %2 + P2 = 1, À is the trust 
threshold. 

When determining the approximate neighbor set of mobile users, the approximate 
neighbor of t% is selected as the approximate neighbor set of the target user. When 
t= 0, the approximate neighbor set is empty set, t= 100, then the approximate 
neighbor set includes all the mobile users in the data set, so we cannot get the desired 
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result when the value of t is too small or too large. In this paper, according to a number 
of experimental results to select a reasonable value of t. 


(2) Identify the target mobile network service set that needs to be predicted 


According to literature [14], the higher the correlation between the mobile user and 
the mobile network service, the greater the possibility that the mobile user will use the 
corresponding mobile network service in the future. Therefore, when determining the 
target mobile network service set that needs to be predicted, selecting the relevance of 
the former h% of the mobile network services as the target user is most likely to use a 
collection of mobile network services. 


(3) Predict mobile user preferences 


Combining the harmonic weight with the traditional preference prediction formula, 
the weighted average forecast formula [15] is 


Fu; s = Fu; PS 2 Nu, (W (ui, 8, uj) x (Fas = Fu) ) (8) 
SE Sy, 


Where N,,; is the nearest neighbor node set for mobile user u; S,; 1s the set of 
mobile network services needed for mobile user u; k is a normalization factor 


k= diy z Nu, 6W (ui, s, uj). 
SE Sy 


3 Experiment and Analysis 


3.1 Introduction to Data Sets 


In order to verify the feasibility and validity of the proposed method in the mobile 
network service environment, the experimental data were validated by two data sets. 
One is the MIT data set based on mobile communication behavior provided by Mas- 
sachusetts Institute of Technology (MIT); One is publicly available for movie score 
data sets Filmtipset. The disadvantage of the MIT dataset is that the number of mobile 
users is relatively small. The number of mobile network services is relatively small. 
Therefore, in order to further verify the validity of the method, the filmtipset dataset is 
used to verify. Although the Filmtipset data set is not for the mobile network to collect, 
but the data set contains a user score for the project as well as the user’s social 
relationship, so you can verify the validity of the method in a certain extent 


(1) The MIT data set [16] is public data provided by the Massachusetts Institute of 
Technology media lab set, the data set is a collection of 94 mobile users, mobile 
user behavior information from September 2004 to June 2005 a total of 9 months, 
including mobile communication and mobile Internet services using mobile users, 
mobile context information related to the user’s friends and relations. 

(2) Filmtipset [17] is Switzerland’s largest film recommendation community, which 
has more than 80000 registered users and more than twenty million movie rating 
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data. The Filmtipset community not only provides the relevant scoring informa- 
tion based on user records, but also includes the social relations between users. In 
this paper, we randomly selected 500 of the users and these users to make the 
evaluation of the 1000 films for the experiment. 


Since the MIT data set does not include the user’s evaluation of the mobile network 
service, the data set is processed as follows before the experiment: By using the 
behavior of the mobile service for all users in the nine months, the user’s preference for 
the mobile service is sorted in the order of time, and the user’s preference information 
is quantized in the form of scores (1 to 10). 

During the experiment, the two data sets are processed in the same way, each data 
set is divided into five groups randomly, each group is divided into two parts, one is the 
training set, the other is the test set, which training set accounts for 80% of each group 
of data, test set accounted for 20%. 


3.2 Evaluation Standard 


In this paper, Mean Absolute Error (MAE) [3] and F measure [18] were used as the 
evaluation criteria. 

The MAE measures the accuracy of the forecast by calculating the deviation 
between the predicted user score and the actual user score. The smaller the MAE value, 
the higher the accuracy of the prediction. Assume that the predicted score set is rep- 
resented as {p1, Po, ..., Pn}, The corresponding actual score set is expressed as {q 1, qo, 
..+5 On}, MAE is defined as 


MAE =J [pi—ail/n (9) 


The F indicator is a comprehensive assessment of the accuracy and recall rate, The 
higher the F index is, the higher the accuracy is, In this paper, according to the 
experimental results evaluation criteria for the precision and recall rate is redefined as 
follows: 


P(u) = (u number of trusted users) / (u the number of contacts) (10) 
R(u) = (u number of trusted users) /(u the number of good friends) (11) 
among them u € U, F index: 


F(u) = (2 x Plu) x R(u))/ (Pu) +R(u)) (12) 


3.3 Analysis of Experimental Results 


(1) Comparison of the accuracy of trust and trust threshold selection 


This experiment is carried out on the MIT data set, this paper mainly compares the 
accuracy of the trust degree calculated by the logarithm function and the degree of trust 
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based on the linear function. The range of the indirect trust attenuation factor D is set to 
[0.5, 1.0), the step size is 0.05, according to the experimental results, the D is set to 
0.55. The threshold value of the confidence threshold is set to [0.1,0.9] and the step 
length is 0.1. The experimental results are shown in Fig. 1. You can see from Fig. 1, 
based on the logarithmic function of trust calculation method is superior to linear 
function calculation method based on trust, this is because between user preference and 
traffic with diminishing marginal utility theory, the logarithmic function can better 
reflect the relationship between them. 

You can see from Fig. 1, the F index is less than 0.6 in a general increasing trend, 
in the case of more than 0.6 overall decreasing trend, this is because when a value is 
small, too many users trust, when is large, remove too much trust users, both caused a 
decline in trust the degree of accuracy. The results show that when the threshold of trust 
is 0.6, the F index is larger. In this paper, the trust threshold is set to 0.6. 


(2) Harmonic weighting parameter selection experiment 


This paper makes use of genetic algorithm in Matlab (Genetic Algorithm GA) 
toolbox respectively on the two datasets, the parameter «, and pı and «> and beta f> 
tuning experiments. The fitness function of genetic algorithm is defined as follows: 


eure = (D 2) /n (13) 


qi 


In this paper, the range of the harmonic eights «, and «> in the Eq. (7) is set to a, € 
(0,1) and a € (0,1), while pı = ja, and f2 =- 1-%2. The optimal parameter values 


obtained by genetic algorithm optimization are shown in Table 1. 


Table 1. Tuning parameter values 


MIT data set | Filmtipset data set 


Parameter name 














Al 0.38 
By 0.62 
A2 0.67 
po 0.33 


From the experimental results, we can see that in determining the mobile user 
approximate neighbors, if the trust is greater than a given threshold, the trust plays a 
major role. If the degree of trust is less than a given threshold, the score similarity plays 
a major role. This is because people are always more trusted about the more familiar 
people, and their preferences are more likely to be affected by the surrounding people 
(such as family, friends, etc.), and for relatively unfamiliar people, their preferences are 
primarily influenced by user preferences that are similar to their ratings. When the trust 
is greater than a given threshold, mobile users are more likely to be family members, 
friends, etc., and trust is less than a given threshold, the mobile users are more likely to 
be strangers [19, 20]. 
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(3) Preference prediction experiment based on public data set 


In this study, the following 3 methods were used to predict the user’s preference in 
the two data sets: Tradition Collaborative Filtering, TCF; Preference Prediction based 
on UserTrust, PPUT (i.e., only the use of (1) and (3) in Sect. 2.5, the method of using 
link prediction) and Preference Prediction based on User Trust and Link Prediction, 
PPUTLP (The method proposed in this paper). Experimental results shown in Figs. 2 
and 3, according to a number of experimental results, select t = 25. 

As can be seen from Figs. 2 and 3: 


(1) For the MIT dataset, the PPUTLP method has the highest accuracy when t = 15; 
for the Filmtipset dataset, when t = 25, the PPUTLP method has the highest 
accuracy. This is because when the value of T is too small, the user’s data is 
deleted too much. When the value of T is too large, it contains too much noise 
data. In these two cases, the accuracy of prediction is reduced. In addition, for 
different data sets, the proportion of the approximate neighbors is not the same, 
we need to select the size of the approximate neighbor data set according to the 
specific circumstances of the experimental data set. 

(2) For the two data sets, compared with the TCF and PPUT methods, the PPUTLP 
method has a lower MAE value in the case of different neighbors. This is because 
the user’s preference is not only related to the similarity of the score, but also by 
the surrounding people (such as family, friends, etc.), The TCF method does not 
consider the trust relationship between users when determining the user’s 
approximate neighbors, but only considers the similarity of the scores. PPUT 
method in determining the approximate neighbor users, not only consider the 
similarity score effect on user preference, also consider the surrounding personnel 
(such as family, friends etc.) impact on user preferences, so the prediction 
accuracy of PPUT method is better than that of TCF method. PPUTLP method on 
the basis of PPUT method, first through the link prediction to select the user may 
use the project, to a certain extent, ease the sparsity problem, so PPUTLP method 
is better than the PPUT prediction method. 

(3) On the MIT dataset, the PPUTLP method has an average improvement of 8.4% on 
the MAE compared to the TCF method. Compared with the PPUT method, the 
PPUTLP method has an average improvement of 3.12% on the MAE [21]. On the 
Filmtipset dataset, the PPUTLP method has an average improvement of 4.03% on 
the MAE compared to the TCF method. Compared with the PPUT method, the 
PPUTLP method has an average improvement of 1.94% on the MAE. The pre- 
diction accuracy of MIT data set is generally larger than increase based on the 
Filmtipset data set based on this is because of the unique characteristics of mobile 
network, according to the communication behavior of mobile users to obtain more 
reliable than the Internet in social relations, so the trust relationship between users 
get more credible, the prediction accuracy is better user preference. 
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4 Conclusion 


Based on the characteristics of mobile communication network, this paper proposes a 
mobile user preference prediction method based on trust and link prediction. By 
analyzing the potential social relations in mobile communication networks, the trust 
between mobile users is calculated, and the relevance information between mobile 
users and mobile network services is explored by using link forecasting technology, 
finally, the mobile user preference is predicted by the mobile user rating similarity. The 
experimental results verify the feasibility of the proposed method in mobile user 
preference prediction. The method proposed in this paper can obtain more accurate 
mobile user preferences, so it can provide mobile users with accurate personalized 
mobile network services. 

In the course of this study, we do not consider the privacy and security of mobile 
users. In the follow-up study, we will consider how to accurately predict the preference 
of mobile users while ensuring the privacy of users. 
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Abstract. A debate model based on evidence theory is proposed to solve the 
problem of group decision-making under uncertain conditions. First, the 
framework of the debate system is constructed. The internal structure of the 
argument is composed of preconditions and conclusions. There is not only an 
attack and a support relationship between the arguments, but also to support or 
oppose such attacks and support relationships. Then we introduce the evidence 
theory to describe the uncertainty of the argument, apply the evidence mapping 
method to the uncertainty process of the debate process, and realize the 
numerical calculation of the reliability of the argument. Finally, a simulation 
example is given to demonstrate the effectiveness of the model. 


Keywords: Evidence theory - Uncertainty - Argumentation model 
Evidence mapping 


1 Introduction 


In the dynamic environment, the information obtained by the decision-making body is 
usually incomplete and inconsistent, so there must be conflict and disagreement among 
the subjects, and the debate is an effective way to resolve the conflict [1, 2] and has 
become the field of artificial intelligence research Hot issues. The abstract debate 
framework proposed by Dung [3] is a large-scale debate model, after which many 
scholars have extended the abstract debate framework, such as the rule-based debate 
framework [4], the bipolar debate framework [5], the expansion of bipolar debate 
Framework, etc. [6]. 

The above model is mainly applied to the reasoning of the debate under certainty, 
and the actual debate process there is uncertainty. In recent years, scholars at home and 
abroad have studied the debate model under uncertainty conditions, such as the 
preference-based debate framework proposed by Amgoud and Vesic [7], Tang et al. [8] 
proposed a debate based on Dempster-Shafer theory, Xiong et al. [9, 10] based on the 
credibility of the debate model and evidence theory based on the debate model. 
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The above model achieves the uncertainty description of the debate model in different 
ways, but it is not comprehensive enough to construct the relationship between argu- 
ments in the debate system. 

This paper expands the description of the relationship between the arguments by 
referring to the model of the literature [11] in the description of the internal structure of 
the argument. There is not only a mutual attack and support relationship between the 
arguments, but also to support such attacks and support relationships. Against the 
theory of evidence is used to analyze the reasoning process of uncertainty under the 
condition of uncertainty, and the quantitative description of the reliability of the 
argument is realized. 


2 Group Discussion Framework 


Group discussion is carried out through dialogue among speakers, and the arguments 
(views) generated during the dialogue process constitute the content of the whole study. 
The research framework is a formal description of the research arguments, including 
the description of the internal structure of the argument and the interrelationship 
between the arguments. Based on the literature [11], this paper proposes an extended 
research model, which includes not only the attack and support relationship between 
the arguments, but also the support or opposition to such attacks and support 
relationships. 


Definition 1. [11] The statement is a description of things, it can be objective data, 
subjective judgments, factual phenomena, theoretical knowledge, etc., constitute the 
basic constituent units of the argument, recorded as h. 


Definition 2. [11] Reasoning argument is a tuple of A; = (H,h), Among them, H = 
{hi,h2,---,hn} is a statement, and satisfy: (1) H is consistent, namely Ah;, h; € 
H,h; = ~h;; (2) Logically, H infers the h, denoted as H = h. The H is called the 
premise of the reasoning argument, the h is called the reasoning argument, and the set 
of all arguments is denoted as A;. 


Definition 3. [11] The definition is provided with two theories of A; = (H1,h1), 
B, = (Ao, ho): (1) if 3h € Ho, h, = =h, A, attack B, denoted by A, — B, or (A;B,) ; 
(2) if dh € Hy, hı = h, A, support B, denoted as A, — B, or (A, B,)”. The relationship 
between attack and support relationship referred to as an argument-the argument 
relation, denoted as ry = (A;,B;), as shown in Fig. 1. All the arguments-the set of 
arguments are recorded as y. 


Definition 4. The rule argument is a two tuple Ag = (H,7r7), where H is a set of 
statements, ry is an argument-argument relationship, and satisfies: (1) H is compatible, 
(2) Logically H inference out of ry, denoted as H = ryu. The set of all rules argument is 
denoted as A,. 


Definition 5. A rule C, = (A,r), argument-argument relationship r2 = (A;, B+): 
(1) if rı = 712, C, against r2, denoted as Cg— rp or (Cz, r2) ; (2) if r1=r2, C, support 
r2, denoted as Cy—crz or (Cg, ry). The support and opposition of the argument to 
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Fig. 1. Diagram of relations between reasoning arguments 


argument-argument relation is called argument rule relation, denoted as rie = (Cg, r2). 
The set of all arguments-rule relations is denoted by ıp. The four types of 
argument-rule relations are shown in Fig. 2. 

Figure 2(a) indicates that the C, is against A, attacks on B;; Fig. 2(b) indicates that 
the C, supports A, attacks on B; Fig. 2(c) indicates that C, is opposed to A, support for 
B, Fig. 2(d) indicates that C, supports A, support for B,. 


Definition 6. The debate-based research framework is a two-tuple AG = (A, R), 
where A = A, UA, is the argument set and R = Ry U Rig is the relation set. 





ce) Cd) 


Fig. 2. Diagram of relations between arguments and rules 


3 Debate Model Based on Evidence Theory 


In a dynamic environment, the relationship between the argument and its arguments is 
uncertain. The evidence theory is an effective method to express and deal with the 
uncertainty reasoning. This paper analyzes the argument and the relationship of the 
research model based on the evidence theory. 
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3.1 The Uncertainty Representation of the Argument Premise 


Definition 7. [12] For a verdict, the complete set of all possible answers is © = 
{01,02,---,0,}, and the elements in © are mutually exclusive, then © is the recog- 
nition frame. The power set of © is denoted by 2°. 


Definition 8. [13] For any proposition A€2®, the definition of mapping m : 2°—[0, 1] 
satisfies the following conditions: 
(1) m(0) =0 
(2) E m(A) =1 
ACO 


Then we called m for © on the basic belief assignment (basic belief assignment, 
BBA). 


Definition 9. [14] Assuming that the two independent evidence mı and m under 
frame © are identified, m = mı m is defined as 





m(A) = 4 ~~, A#0 (1) 
; A=) 
according to the Dempster combination rule, where K= X> mı(A;)m(B;) is the 
A; N B;=0 


degree of conflict between the evidence mı and mp. 


Definition 10. The truth of a statement h as a question to be judged, construct a 
statement recognition framework, recorded as ©, = {True, False}, where True rep- 
resents the statement is true, False on behalf of the statement is false. 


Definition 11. For the statement recognition frame ©, = {True, False}, the reliability 
assignment function mp : 2°'—[0,1] defined on 2°, where m,(True) = a, is the 
reliability of the statement that the statement is true according to the current situation, 
m,(False) = f, indicates that the statement is false, m,(@,) = y, = 1 — « — f is the 
confidence of the statement unknown reliability, the confidence vector of statement h is 
denoted as w(h) = (an, Pns Yn). 


Definition 12. The authenticity of a certain premise H = {h;, h2,- +- , An} as a pending 
decision, the premise of the framework to identify the framework, denoted as 
Oy = {True, False}. 

The reliability vector of the premise H is denoted as w(H) = (ay, PH, Yy). When 
the current H is the combination of multiple statements, that is H = h,/A---/Ah,, then 
w(H) = w(h;), where i satisfies: «p, = ane Lyf that is, the premise of the relia- 

rane 


bility of the value of the vector from the lowest credibility of the statement. When the 

current H is the recursion of multiple statements, that is H = h,V--- Vha, then w(H) = 

w(h;), where i satisfies: op, = 1 Lay that is the prerequisite reliability vector is 
JE 1,---,n 


determined by the highest degree of confidence. 
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3.2 The Reasoning of Uncertainty in Argument Conclusion 


3.2.1 Reasoning Rule Representation Based on Evidence Mapping 


Definition 13. [10, 15] Evidence mapping refers to the mapping of the prerequisite 
framework ©y to the conclusion recognition framework ©,, which describes the 
reasoning relationship between the elements in Oy and the elements in ©}. Evidence 
mapping from ©, to ©, is expressed as T : @y—22°"* 10-1] 
each element in Oy is expressed as 


— (On; „f (On; 8h, )), my 
I'(6x,) —_ (On, (Ou,—9n,, )) (2) 


. The evidence mapping for 


where Oy, is the prerequisite element, Op, is the element of the conclusion, and 
f (1,91, ) is the rule strength of Oy, supporting Op,- 
Make Oy, = U; On} a prerequisite for Oy, to infer all the conclusion of the 


collection. 
Equation (2) satisfies the following condition: 


(2) 0<f(0n,>91,) <1 


(3) YP (O19) = 
2 


3.2.2 The Calculation of the Conclusion of Reasoning 

Under the condition of uncertainty, for the reasoning argument A, = (H,h), the rule 
relations between the prerequisite H and the conclusion h include the following: 
H—h, H— ah, ~H-—h, ~—H— ~h. The corresponding regular strength is: 
Hh = f(H—h), H-7h = f(H— ~nh), ~—Hh = f( nH —>h), ~H-h = f( ~H —> —h). 


Definition 14. [10] For the reasoning A; = (H, h), we define the mapping matrix 


R(A;) = Ce from the premise H to the conclusion h, and the row headings are H, 


~H and the complete set @7,; the column headings are h, ~h and the complete set Oy. 
The specific form of R(A,) is 


Hh Hah l— riri 
R(A;) = aHh ~Hah 1-— hNi — mMm 
r31 ra l= =T 


among which, 


"Pee (rii + m1 )/2, rii X ra FO (3) 
a 0, rii xX r21 =0 
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ma (Tia +r2)/2, r2 X rn AO (4) 
= 0, r2 X run = 0 


According to the reliability value vector w(H) and the mapping matrix R(A;) of the 
premise H, the reliability value vector 


w(h) = w(H) - R(A;) (5) 


of the conclusion h can be obtained. 

Assuming that there are multiple prerequisites pointing to the same conclusion h, 
we can fuse the multiple confidence values of h according to the Dempster combination 
tule to get w(h) = w(h),@w(h),@--- @w(h),. 








3.2.3 The Calculation of the Conclusions of the Rules 
For the rule argument A, = (H, ryu), where ry=(A;, By), Ar = (Hi, hi), Br = (M2, M), 
we can see the argument-argument relation ry, whose essence corresponds to the 
mapping matrix R(A,) of A;. Therefore, the relationship between the premise H and the 
conclusion ry, is the relationship between H and R(A,), including the following: 
H — f(H, — hı). The corresponding regular intensity is recorded as H(H,h,), 
H(A,7h,), -7H(Ayh,), —H(Hi>hi), A(-A\h,), A(-A\7h), -=H(-A;h,), 
=H (-H,-h). 

Because the relationship between the premise H and the conclusion ry is more 
complex, it is divided into two categories (the first four categories, the second category 
for the last four) were calculated and then synthesized. 


Definition 15. For the rule argument A, = (H, rų), the mapping matrix defined by the 
prerequisites H to R(A,), the first row R(A,)", is Rj = “ae 


are H, ~H and the complete set ©y; the column heading is the rule strength deduced by 
Hy, followed by Hıhı, H,—-h,; and complete works @y,. The specific form of R; is 


and the row headings 


H(A,h,) H(H-h,) l—74-—Trp 
R] = =H (Hh) =H (H,—h,) 1 — hNi — 199 
31 r32 E 2 


where r31 and r32 are calculated by the same formula (3) and (4). 
According to the reliability value vector w(H) and the mapping matrix R, of the 
premise H, the first behavior 


R(A,) by) = w(H) - Ri (6) 


of the mapping matrix R(A;) can be obtained. 
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Definition 16. Similarly, the mapping matrix defined by the second row R(A;) 7) of the 


prerequisites H to R(A,) is Ro = (rij) The row headings are H, ~H and the com- 


3x3" 
plete set of @y; the column heading is the intensity of the rules deduced by =H, 


followed by —=H,h,, ~H,—h, and the complete set O.y,. The specific form of Rz is 


H(-H,h,) H(7=H,-h,) 1 — Fil — F12 
Ro = =H (=H,h,) =H (=H, -h,) 1 — hi — F993 
F31 r32 l= =r 


where r31 and r32 are calculated as follows (3) and (4). The second row of the mapping 
matrix R(A;) can be obtained 


R(A,)¢7) = w(H) - Ro (7) 


Thus, under the influence of H, the original mapping matrix R(A,) is updated to: 


(1) Update the first row of R(A;) according to the Dempster combination rule: 
R(Ar)pew= RA Ji BRA): 


new 


(2) Similarly, the second line of R(A,) is updated as R(A,)& R(A,) BR(A,) 


a 


(3) The third line of R(A;,) is calculated by the same formula (3) and (4). 





3.3 Declaring the Update of the Reliability 


In the course of the study, when there is a new reasoning argument Ao, if there is an 
argument-argument relation (A2,A1), then Az will change the confidence value of a 
statement hı; in the premise H, of A,, thus changing the reliability of the conclusion h, 
of A,. As the study continues, the relation (A3,A2) is generated, then A3 changes the 
confidence value of a statement hz; in the premise H2 of A2, thus changing the con- 
fidence value of the conclusion hy of A> (that is, h;;), and the final conclusion of the 
reliability of Ay change hı. In addition, when a new rule argument A4 is generated, if 
there is an argument-rule relation (A4, (A2,A,)), then A4 changes the mapping matrix 
R21 of A; to Ay, thus changing the confidence value of A2’s conclusion hp (that is, hii), 
and eventually changing A ;’s conclusion h,’s reliability value. Thus, for each new 
argument, the correspondence value of the corresponding statement on the entire 
argument chain will change. 


(1) When adding a reasoning argument, the algorithm described in the framework of 
the reliability update is as follows: 


Stepl: Generate a reasoning argument A; = (Hy,h,), set the initial confidence 
value of each statement hy; in the preamble Hy = {haj,ha2,---, han} to 
w(hai) = (nais Bha Yh J. calculate the reliability value w(H,)’ of H4, and generate 
the inference rule mapping matrix R4 from Hy to hg. 

Step2: Calculate the confidence value w(h4) = w(H,) - Ra of ha, and if ha is the 
target statement for the whole debate, the stated reliability value is updated. If h, 
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belongs to the premise of an argument B, = (Hpg, hpg), that is ha = hg;EHp, then ha 
„= w(hgj) @w(ha), the reliability value 
of Hg is updated, and the reliability value w(hg) = w(Az),,,, Re of hg is 


updates the reliability value of hg; to w(hg;) 
w(HB)new 
calculated. 

If hg is still a prerequisite for an argument, follow the steps 2 above to proceed with 
the update until the entire statement of the debate is updated. 


new 


(2) When adding a rule to the argument, the discussion within the framework of the 
reliability value update algorithm is as follows: 


Stepl: Produce a rule argument for C, = (Hc,rc), where the premise is 
Hc = {hc1ı, hc2,: -+ , hcn}, the relationship is rc = (A,, B+). The initial reliability value 
of each statement hc; is set to w(hci)? = (Ones Bags Vier) > the reliability value w(H-) of 
Hc is calculated, the initial mapping matrix of A, is R(A;)), and the updated 
post-mapping matrix is R(A;),_,,- 


Step2: Calculate the update reliability value w(h4)„e„= W(Ha) - R(Ar) oy Of ha, and 


= w(hgj) @w(tA) pow: and then 


then h4 updates the reliability value of hg; to w(hpj) oy 


update the reliability value of hg. 


3.4 The Acceptability of the Statement 


If the confidence vector of statement h is w(h) = (&n, Pn, Yn), it is possible to determine 
whether it is acceptable according the following three rules (Table 1): 

Rulel: If xn > a, statement h is acceptable. 

Rule2: If „< fpo, statement h is acceptable. 


Table 1. Arguments of the discussion 


Argument | Argument Argument Prerequisites initial Reasoning rule 
premise conclusion reliability values strength 





Aj hı A hy w(h,,) = (0.7,0.2,0.1) H, — h(0.9) 
w(hi2) = (0.85,0.1,0.05) |H; — —h(0.8) 
Az hy, A hoz A hy3 w(h1) = (0.8,0,0.2) Hy — h,,(0.9) 


w(h22) = (0.9,0.1,0) 
w(h23) = (0.7,0.1,0.2) 








A3 h31 /\ h32 w(h31) = (0.8,0.1,0.1) H3 -F hy3(0.8) 
w(h32) = (0.7,0,0.3) 
A4 ha, A (haz V haz) | hy2 w(h41) = (0.6,0.1,0.3) H4 — ~h,2(0.8) 


w(h42) = (0.8,0.2,0) 
w(h43) = (0.7,0.1,0.2) 
As r(A4,A1) w(hsı) = (0.9,0,0.1) Hs > (H4 > hi2) 
w(hs2) = (0.8,0.2,0) (0.5) 
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Time node 1 produces argument Aj, the reliability value w(H D)! = w(hy1) of the 


09 0 0.1 
premise Hı, the rule mapping matrix Rj; = | O 0.8 0.2], the confidence value 
O 0 1 


w(h)'= w(H1)'-Ry, = (0.63,0.16,0.21) of the conclusion A. 

Time node 2 produces argument Ao, the reliability value of w(H>)* = w(h3) of the 
premise H>, and the confidence value w(hı) = w(H>)?-Ro| of the conclusion h,,. The 
reliability value of hıı is updated to w(hi1)>,,=w(hi1) @w(A11) = (0.8730,0.0847, 
0.0423). In this case, the reliability value of H; is w(H D = w(hi2), the reliability value 
of h is updated to w(h)’= w(H,)*-Ri; = (0.7650,0.0800,0.1550). It can be seen that 
the reliability of A; conclusion h is improved by support of A> for Aj. 

Time node 3 produces argument A}, A3 reduces the credibility of A2 premise, and 
thus reduces the credibility of hıı, the final A of the reliability value is updated to 
w(h)°> = (0.7267,0.1027,0.1706). 


The time node 4 produces the argument Ay, w(h12)4, „= w(hi2) Ow(h2) = 
(0.7466,0.2095,0.0439), the confidence degree of hi2, is reduced by Ay and the reli- 


ability value of h is updated to w(h)* = (0.6720,0.1676,0.1605). 
The time node 5 produces the argument As, As against the attack of Ay to A,, and 


the first row of the mapping matrix R4, of A4 is updated to RD = RO RP = 
(0.1176 0.7059 0.1765). Thus the reliability value of h is updated to w(h)° 
(0.7018,0.1442,0.1540). 

Assuming that the acceptability of the statement is discriminated by rule 1, if the 
threshold is ġo = 0.7, then h is acceptable at the end of the study. 

In the whole study, the update process of the presentation reliability value is shown 
in Table 2. 


Table 2. The renewal process of the reliability values of statements 


Statement Time 3 























h (0.63,0.16, | (0.7650,0.0800, | (0.7267,0.1027, | (0.6720,0.1676, | (0.7018,0.1442, 
0.1706) 0.1605) 0.1540) 

hy (0.8074,0.1284, | (0.8074,0.1284, | (0.8074,0.1284, 
0.0642) 0.0642) 0.0642) 

hy (0.85,0.10,0.05) | (0.7466,0.2095, | (0.7797,0.1803, 

0.0439) 0.0400) 

ho (0.80,0,0.20) | (0.80,0,0.20) | (0.80,0,0.20) | (0.80,0,0.20) 

hy (0.90,0.10,0) | (0.90,0.10,0) | (0.90,0.10,0) | (0.90,0.10,0) 

hy3 (0.7,0.1,0.2) | (0.4565,0.4130, | (0.4565,0.4130, | (0.4565,0.4130, 
0.1304) 0.1304) 0.1304) 

hs (0.8,0.1,0.1) | (0.8,0.1,0.1) | (0.8,0.1,0.1) 

hyo (0.7,0,0.3) (0.7,0,0.3) (0.7,0,0.3) 





(continued) 
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Table 2. (continued) 











Statement | Time 1 Time 2 Time 3 Time 4 Time 5 

hay (0.6,0.1,0.3) (0.6,0.1,0.3) 
h42 (0.8,0.2,0) (0.8,0.2,0) 
h43 (0.7,0.1,0.2) (0.7,0.1,0.2) 
hsı — (0.9,0,0.1) 








hs2 — (0.8,0.2,0) 


4 Conclusion 


In this paper, the debate model based on evidence theory is constructed for the problem 
of group decision-making under uncertain conditions. First, a more comprehensive 
framework of the debate system is defined, which defines the relationship between the 
internal structure of the argument and the argument, which not only includes attacks 
and support relationships, but also allows for support or opposition to such attacks and 
support relationships. Then we introduce the theory of evidence to describe the 
uncertainty of the argument, apply the evidence mapping method to the uncertainty 
process of the debate process, and realize the transmission and renewal of the reliability 
of the argument. 
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Abstract. Software knowledge plays an important role in software testing and 
software reliability model. This paper proposes that software knowledge affects 
the software reliability distribution significantly based on the theoretical analysis 
on the Weibull distribution of defect density, and proof that the software knowl- 
edge amount mainly affects from the scale parameter c of Weibull distribution, 
while c can be expressed as a quantitative expression of software knowledge 
amount. In this paper, engineering experiment is carried out to verify the proposed 
conclusion, which shows that more knowledge testers have, the smaller the scale 
factor c of Weibull distribution becomes. Furthermore, according to the degree 
of the software knowledge, the trend of the problems found in testing can be 
predicted, so as to evaluate the reliability of the software. 


Keywords: Software knowledge - Software test - Reliability model 
Weibull distribution 


1 Introduction 


Software reliability is the possibility of software runs successfully according to the 
design requirement under a given time interval and required environment condition 
[1-3]. While software product is released, the reliability and latent fault of software 
product can be predicted and estimated through the software reliability theory, which 
includes two reasons: (1) It can be used as the objective statement of product quality. 
(2) It can be used for the resource planning in software maintenance phase. 
Software reliability model plays an important role in software reliability theory, 
which includes two types as static model and dynamic model. In static model [4, 5], the 
coefficient is estimated from previous data of software products, while relative software 
product can be the supplement observation for the total project. Static model did not 
consider the time factor, but actually is a quality model. Dynamic model [6—8] is mainly 
used in the software development phase and software reliability testing phase The 
Rayleigh model is a typical model used in the development phase, while exponential 
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growth model and other reliability growth model is used in the test phase. All dynamic 
reliability models can be denoted as a function of time or its logic in the development 
phase. 

Knowledge can provide guidance for software testing and many studies have been 
carried out. Xu [9] indicated that the cause of software errors is the errors of knowledge 
used and its usage in the software system, and proposed software testing method based 
on the knowledge by practice analysis, which can make up the inadequacy of testing 
mothed adequacy and suitability effectively. Vijaykumar [10] proposed the concept of 
Reference Ontology of Software Testing by identifying and reuse of the ontology in the 
software testing process, which can gives semantics to a large number of software testing 
information, so as to support the knowledge management in the software testing process. 

However, the research of the effect of Software knowledge on software reliability is 
very few. How to describe the relationship between knowledge and software reliability, 
so as to achieve high quality of software testing with knowledge, and improve software 
quality and reliability, scientific and quantitative analysis is necessary. Therefore, this 
paper estimates and is focus on the relationship between the software knowledge and 
software reliability. 

The remainder of this paper is structured as follows. In Sect. 2, Weibull distribution 
is briefly summarized. The relationship between the software knowledge and software 
reliability are derived in detail, and a formalization expression of the relationship are 
described briefly in Sect. 3. Engineering Experiments results and analysis are reported 
in Sect. 4 and Sect. 5 concludes this paper. 


2 Software Rellability Distribution 


Weibull distribution [11, 12] is used for reliability analysis in various engineering fields 
for many years, which from bearing fatigue life of the deep groove ball to the tube fault 
and river flood. Weibull distribution is one of the three famous extreme value distribu- 
tions. The significant feature of Weibull distribution is that the probability density grad- 
ually tends to zero. Figure 1 shows the Weibull probability density curves with different 
shape parameter m values. 

The Cumulative Distribution Function(CDF) and Probability Density Func- 
tion(PDF) of Weibull distribution can be formulated as: 


CDF: Fi) =1-e" (1) 


PDF: f(t) = z (<) ew o)" (2) 


where m denotes the shape parameter, c denotes scale parameter, t denotes time. 
Rayleigh model is special case of the Weibull distributions which shape parameter 
m is equal to 2. The CDF and PDF are expressed as: 


CDF: F =1—-e/ (3) 
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Fig. 1. Several typical curves of Weibull distribution 


PDF: f(t) = = ( : yew oP (4) 


Exponential model is another special case of the Weibull distributions which shape 
parameter m is equal to 1. Exponential model is suitable for statistical process that 
monotonous decline to progressive values. The CDF and PDF are expressed as: 


CDF: Ff) =1-e%% =1-e* (5) 
PDF: f(t) = ae = je (6) 


In engineering practice, the above formulations need to be multiplied by the total 
defect number or total defect cumulative probability K, where K is an estimated param- 
eter for deriving specific model from the dataset. 


3 Analysis of Knowledge and Rellability 


3.1 Relationship Derivation 


Observed from the engineering practice, the experience and knowledge of testing team 
can affect the testing efficiency and problems found. Generally, high level testing team 
can find problem more quickly and the problems found are subject to certain mathe- 
matical distribution. 

Two Weibull functions are used in the application of software reliability usually that, 
Rayleigh distribution is used to describe the defect distribution in development phase, 
exponential distribution is used to describe the fault distribution in system test phase or 
after product delivery. Due to the significant correlation between the knowledge of 
testing team and testing efficiency from engineering practice, this paper analysis the 
quantitative relationship of knowledge and fault distribution by Weibull function. 
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This paper takes the system testing phase for example, where the shape parameter 
m is equal to 1. There are three hypotheses as follow. 


Hypothesis 1: The knowledge quantity of testing team A and B are S, and Sp corre- 
spondingly. 

Hypothesis 2: Testing team A and B test the same software product respectively, while 
they can find out all defects through learning and experience accumu- 
lation. 

Hypothesis 3: The fault distribution of software product is subject to Weibull distri- 
bution in software reliability field. 


Assuming the fault set X = CEN T s is the sum of all kind fault modes found 
at t moment. 


X=K. f= Se (7) 


where K denotes the total defect number, c denotes scale parameter. 

As observed from the engineering practice, the experience and knowledge of testing 
team can affect the test efficiency and the found problems. Therefore, the knowledge 
quantity of testing team is positively related to the amount of defects can be found. 


K = K(S)«S (8) 
By plugging Formula (8) to Formula (7), the following formula is obtained as: 


_KS) iyo 
C 


X (9) 


which means that the defect number X found at moment tis related to knowledge quantity 
S and scale factor c. X can be expressed by S and c as: 


X = X(S,c,t) (10) 
Assuming the fault probability distribution of testing team A and B are f, (t, cı) and 
h (t, C3) correspondingly: 


KSA) cep 


1 


X, = KS) AO = (11) 


K(S 
X, = K(Sp) hÀ = SEB ets (12) 
2 


Subtract Formula (11) and (12) after logarithmic, the following formula is obtained 
as: 
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C—C 
In X, —InX, = InK(S,) —InK(S,) — lnc, + Inc, — t * — : (13) 
12 
Formula (13) can be rewritten as: 
AlnX = AlnK — Alnc -t x (14) 
C109 


High-level testing team can find problem faster and more, which means that if Sa > 
Sg, €} < Cy. For any moment t as a constant, team A finds more faults and more efficient 
than team B. Team A reaches the testing goal faster than team B at the end of the test 
timing. 

Because A In K is a constant, the relationship of A In X and |A Inc|can be shown in 
Fig. 2. 





Alnx 








|Alnc] 


Fig. 2. Relationship between defect number X and scale parameter c 


From Fig. 2, A In X is proportional to |A In c|. Due toc, < c, so that X is inversely 
proportional to c. Morever, Formula (13) indicates that X is proportional to the knowl- 
edge quantity S testing team have, therefore, the knowledge quantity S testing team have 
is inversely proportional to c, and the knowledge quantity S required is proportional to c. 

Therefore, software knowledge mainly affects the scale parameter c in Weibull 
distribution. The smaller c is, the more knowledge testers have, the less knowledge they 
need to obtain; the larger c is, the less knowledge testers have, the more knowledge they 
need to obtain. c can be expressed as a quantitative expression of knowledge quantity 
needed. 


c = Ag(D, A”, R, AË, H, D (15) 


where D denotes the concept set of software knowledge, A” denotes the attribute set of 
software knowledge, R denotes the relationship and rules of knowledge, A? denotes 
attribute set of knowledge relationship, H denotes he level of knowledge concept, I 
denotes instance set, A denotes coefficient. Function g() denotes the measurement of the 
knowledge quantity needed in software development or testing, which is proportional 
to the scale parameter c. 
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3.2 The Effect of Knowledge Analysis 


This subsection analyses the effect of knowledge on size parameter c from development 
phrase and testing phrase respectively. 

For the software development phase, assuming that the software development and 
testing activities are ongoing, the effect of knowledge on the software development 
phase is shown in Fig. 3. 





Fig. 3. The effect of knowledge on development phase 


The knowledge quantity required by software testers has direct effect on the test. If 
the tester’s understanding of the software system is more comprehensive and profound, 
the less knowledge quantity required, the smaller scale parameter c is, which makes 
software fault detection faster and software reliability can be improved effectively. On 
the contrary, if the knowledge and experience of tester is weak, the more knowledge 
quantity required and the larger c is, which makes fault detection slower. 





Fig. 4. The effect of knowledge on the testing phase 
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Similarly, the effect of knowledge on the software testing phase is shown in Fig. 4. 
The less of knowledge quantity tester requires, the smaller scale parameter c is, which 
makes software fault found timely and effectively. On the contrary, if the knowledge 
and experience of tester is weak, the more knowledge quantity required, the larger scale 
parameter c 1s, which makes fault detection slower. 


4 Experiments 


In this section, software engineering testing of acertain Linux operation system 1s carried 
out, in order to verify the proposed conclusion of the relationship between knowledge 
and software reliability. 

There are total 10 round testing designed, the test strategies are shown as follow: 


(1) Each round is independent. 

(2) For each round, the found problems and fault modes are concluded as knowledge. 

(3) For each round, test cases are added based on the increased problems and fault 
modes of the previous round. 

(4) For each round, test case distribution is improved based on the fault mode distri- 
bution of all previous rounds. 


The test results of 10 rounds are shown in Table 1, and the analysis is shown in 
Fig. 5. Firstly, tester can find more faults by learning the software product more as the 
testing going on. The knowledge of tester can be enriched after analyzing these defects 
and faults. Secondly, the number of test problems and fault modes are increased each 
round, while the increasing speed decreases and fault mode tends to 35 finally, which 
means that the knowledge of tester has been improved. 


Table 1. The test results of 10 rounds 


Round Result 
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Analysis of test results 
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Fig. 5. Analysis of the test results 


According to Formula (7), K and c of Weibull cumulative defect distribution are 
estimated with the fault mode number (20, 23, 26, 28, 29) of the first five test rounds. 


X=K-f(t)= See (16) 


The simulation result by Matlab tool is shown in Fig. 6, which K = 27.943, c=0.911. 





Fig. 6. The fitting distribution curve of the first 5 rounds 


Similarly, according to Formula (7), K and c of Weibull cumulative defect distribu- 
tion are estimated with the fault mode number (32, 34, 35, 35, 35) of the last five test 
rounds. 

The simulation result by Matlab tool is shown in Fig. 7, which K = 34.841, c= 0.404. 
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Fig. 7. The fitting distribution curve of the last 5 rounds 


By plugging the above two c into the probability density function of Formula (7), 
two probability density curves are obtained as shown in Fig. 8. 














Fig. 8. The probability density curves of different c 


With the knowledge increasing of tester, more defects are found, and smaller the 
scale parameter c is, which means that the knowledge mainly affect the scale parameter 
c of software reliability distribution. 


5 Conclusion 


On the basis of the theoretical analysis and derivation on the Weibull distribution of 
defect density, this paper gives the conclusion that the knowledge qualification varies 
inversely to scale parameter c of Weibull distribution via the analyses of the quantitative 
relation between the tester knowledge level and defect distribution. Accordingly, the 
proposed relationship can help to analysis the effect of software test team on the product 
reliability curve, so as to predict the trend of finding problems along with software 
reliability, and determine the end time of the test process. 


570 


C. Yang et al. 


References 


11. 
12: 


. Lyu, M.R.: Handbook of Software Reliability Engineering. McGraw Hill and IEEE Computer 


Society Press, New York (1996) 


. Bansal, A., Pundir, S.A.: Review on approaches and models proposed for software reliability 


testing. Int. J. Comput. Commun. Technol. 4(2), 7—9 (2013) 


. Xavier, J., Macêdo, A., Matias, R., et al.: A survey on research in software reliability 


engineering in the last decade. In: Proceedings of the 29th Annual ACM Symposium on 
Applied Computing, pp. 1190-1191. ACM (2014) 


. Duran, J.W., Wiorkowski, J.J.: Capture-recapture sampling for estimating software error 


content. IEEE Trans. Softw. Eng. 1, 147—148 (1981) 


. Nathan, I.: A deteministric model to predict “error-free” status of complex software 


development. In: Workshop on Quantitative Software Models for Software Reliability, 
Complexity and Cost: An Assessment of the State of the Art 


. Musa, J.: Operational profiles in software-reliability engineering. IEEE Softw. 10(2), 14—32 


(1993) 


. Littlewood, B., Verrall, J.L.: Likelihood function of a debugging model for computer software 


reliability. IEEE Trans. Reliab. 30(2), 145—148 (1981) 


. Goel, A.L., Okumotu, K.: Time-dependent error detection rate model for software reliability 


and other performance measures. IEEE Trans. Reliab. 28(3), 206-211 (1979) 


. Xu, R.: The testing method based on software knowledge. J. Wuhan Univ. (Nat. Sci. Edn.) 


46(1), 61—62 (2000) 


. de Santiago Jr., V.A., Vijaykumar, N.L.: Generating model-based test cases from natural 


language requirements for space application software. Softw. Qual. J. 20(1), 77-143 (2012) 
Kan, S.H.: Metrics and models in software quality engineering (2003) 

Covert, R.P., Philip, G.C.: An EOQ model for items with Weibull distribution deterioration. 
AIE Trans. 5(4), 323—326 (1973) 


Development of Virtual Reality-Based 
Rock Climbing System 


Yiming Su, Dingfang Chen”, Congxing Zheng, 
Sihan Wang, Liwen Chang, and Jie Mei 


Institute of Intelligent Manufacturing and Control, 
Wuhan University of Technology, Wuhan, China 
953477859@qq.com 


Abstract. In this paper, development of virtual reality-based rock climbing 
system is demonstrated. As virtual reality possesses immersive, synchronous and 
interactive characteristics, the VR technology has been widely used in many 
engineering areas. Through the combination of multi-path variable amplitude 
rock climbing machine and the virtual reality technology, the users can do the 
rock climbing sports in room environment with immersive experience. The virtual 
rock climbing scene is developed based on the Unity3D engine to achieve the 
interaction among the user, the rock climbing machine and the virtual reality 
environment. The system utilizes virtual reality technologies such as HTC Vive, 
Leap Motion and Unity3D game engine in an attempt to simulate an immersive 
rock climbing experience. The rock climbing machine is designed that capable 
of changing the path or amplitude automatically, which overcomes the defects of 
indoor climbing machine with fixed route. With the VR technology, it greatly 
improves the indoor climbing experience. 


Keywords: Virtual reality - Unity3D - Leap Motion - Rock climbing 


1 Introduction 


Virtual reality (VR) typically refers to computer technologies that use VR headsets to 
generate the realistic images, sounds and other sensations, through which users can interact 
with virtual objects or stimuli that are modeled from the real world. VR environments have 
been used extensively in a variety of fields, such as cinema and entertainment, healthcare 
and clinical therapies, engineering, education and training and so on which offers users 
numerous advantages and benefits such as immersive, interactive and cost-efficient expe- 
riences [1]. The study of VR technology possesses an interdisciplinary characteristic. 
Through the advancement of interface technologies, VR will eventually become widely 
popular, changing our lifestyle and making our work easier [2-4]. 

There are many commercial indoor rock climbing machines available in the market 
[S—7]. The Treadwall series indoor climbing machine, released at 2012, 1s one of the 
most widely used climbing machine, which is the first generation of the track format 
climbing machine. However, its climbing path is single and repeated which lacks 
immersive sense. Currently, there are more and more research and development work 
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using VR techniques to construct virtual environments combined with mechatronic for 
different physical education applications. Widerun is the first fully interactive bike 
trainer specifically designed to deliver engaging fitness sessions through VR headsets 
and external screens, delivering a responsive, immersive, biking experience with 
unlimited virtual 3D worlds featuring games and bike tracks. 

In this research, the VR environment was developed to connect to the rock climbing 
machine and provide a nearly real experience [8, 9]. HTC Vive was utilized as the VR 
helmet, Leap Motion was used to recognize user’s hands and Vive Tracker to track the 
position and orientation of the rock climbing machine. To satisfy the function of immer- 
sive rock climbing system, a suitable development engine is necessary, which should 
work well with HTC Vive, Leap Motion and Vive Tracker with accessory hardware 
devices. Unity3D engine is chosen as it can meet all above needs at the same time [10, 
11]. The system combines rock climbing fitness and social networking, to provide users 
with immersive rock climbing experience, and give users a scientific feedback report of 
health. 


2 Development 


The developed VR system consists of following components, which include the user 
control interface, 3D geometric model of the entire environment and the correspondence 
of the virtual and real. 

The overall VR system architecture is illustrated in Fig. 1. A number of software 
tools as well as hardware interfaces are used to develop the VR environment. 
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Fig. 1. Overall system architecture 


2.1 The Control Interface and the Program Design 


The Control Interface consists of four main interfaces, which are Log in Interface, Mode 
Selection Interface, Setting Interface and Mountains Selection Interface. Each main 
interface is built in different scene using UGUI [12] (Fig. 2). 

Control interface is the windows for information exchanging between user and the 
system. User can interact with it according to his/her needs. Each main interface contains 
one or more sub-interfaces. All of them form the application interface level that imple- 
ments the system function operation. 
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Fig. 2. Four main interfaces of the Control Interface 


So as to accomplish the login function, we create a database with two sheets. They 
are user base information sheet: users: id, username, avatar and user authorization infor- 
mation sheet: user_auths: id, user_id, identity_type, identifier, credential. When the user 
sends the mailbox/username/phone number and password request to log in, we first 
determine the type the check the credential. For example, someone use username 
“xiaoming” to login, part of the code is shown below: 

SELECT * FROM user_auths WHERE identity_type =‘username’ and identi- 
fler=‘xiaoming’ 

It will return user_id when password match with credential. 

When user is in the scene “SelectMount” and select a mount, the program will load 
the ‘LoadingScene’ at first, then start the Coroutine ‘BeginLoading’ on its Start function. 
Part of the code is shown below: 


IEnumerator BeginLoading () 


{ 


asyn = Scene Manager.LoadSceneAsync(MountName); 


yield return asyn; 


The progress of “asyn” will be shown in “LoadingScene” as shown in the Fig. 3. 
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Fig. 3. Loading progress of a mountain scene 


The Setting Interface, which consists of video settings and audio settings, enable 
user to change the effect of mountain scene and the sound of system (Fig. 4). 








ovement) 
F 1920 x 1080 
ISTE thr 4 
Raa sh 
SHADERS a SS fo 
Le 
(a) video settings (b) audio settings 


Fig. 4. Two sub-interface of the Setting Interface 


The video settings allow user to set the resolution, the anti-aliasing mode, the filtering 
mode or shader to tell system how to render. The audio settings are for adjusting the 
volume and quality of all sounds in the entire climbing system, giving the user the most 
comfortable sound experience. 


2.2 Geometric Model 


2.2.1 Construction of Large-Scale Outdoor Rock Climbing Scene Modeling 
The virtual scene model is the 3D data foundation of the virtual reality system [13]. The 
fidelity of the model and the data precision are the key factors to the successful construc- 
tion of the virtual reality system. In this paper, the construction process of large-scale 
outdoor rock climbing scene and the key technology of large-scale virtual scene for VR 
application are systematically studied. Finally, the construction of large-scale immersive 
rock climbing scene is completed. The building process is shown in Fig. 5. 


Step 1: Select the real outdoor climbing peaks for parametric analysis. Study the corre- 
sponding terrain appearance to provide data and graphics support for the estab- 
lishment of virtual mountain model. 
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Step 2: Using the Terrain tool in Unity3D to draw the rough model of the mountain, 
and then build the detailed model to shape it to be similar with the real mountain 
appearance. 

Step 3: Map the model and improve the mapping quality. Add the sky, sunlight, trees 
and other natural elements to increase the reality degree and improve the 
immersive experience. 

Step 4: To determine whether the virtual rock climbing scene is similar to the real peak 
scene. If there is a big gap, it is necessary to continue to optimize the model to 
develop the immersion sense. If it meets the requirements, the scene is supposed 
to be completed. 





UV Points \ geometric/vertex 


Fig. 6. UV mapping principle 


2.2.2 Development of Immersive Scene 

For a three-dimensional model, there are two most important coordinate systems, one 
of which is the position of the vertex (X, Y, Z) and the other for the UV coordinates [14]. 
As shown in Fig. 6, the UV texture map defines the location information in the map, 
which is associated with the 3D model and the location of the model surface texture. 
The UV texture mapping technology is different from the traditional plane projection 
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map. Each point of the image is accurately matched to the surface of the model object, 
and the gap between the two adjacent points is processed by the smooth interpolation 
of the image. 

In the 3ds MAX 3D modeling software, polygon modeling technology is utilized to 
build the rock model, as shown in Fig. 7(a). The rock texture map is obtained by 
contrasting it to the real rock appearance. Through the UV mapping technology, the rock 
model map is rendered to derive the final model, which is shown as Fig. 7(b). 





(a) Original rock model (b) Rock model with UV map 


Fig. 7. Rock model built by polygon modeling technology and UV map method 





(a) Original large-scale terrain (b) Optimized large-scale terrain 





(C) Close-range mountain 


Fig. 8. Mountain modeling 


The brush tool of the terrain editor is utilized to build the mountain model. By refer- 
ring the model to the actual mountain surface of the Roraima, the mountains, valleys 
and plains in the real peaks as shown in Fig. 8(a) are carved out. By using Photoshop 
software to deal with the color map and the normal map of the environment, the corre- 
sponding textures are obtained. Then the terrain editor tool is adopted to select the 
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different materials on the terrain mapping. The model effect is shown in Fig. 8(b). The 
close-range mountain is shown in Fig. 8(c). 


2.2.3 LOD Technology 

In the process of running virtual reality system, the hardware system has to do real-time 
rendering, which puts forward higher requirements to complex scene optimization. LOD 
technology can guarantee the visual effect. With the virtual point of view and the target 
model object distance data changed, the system shows models of different levels. LOD 
follows the principle that rough model is drawing for long range and detailed one is for 
close range. The LOD principle is designed as shown in Fig. 9. 






Moving- 


Direction- 


Fig. 9. LOD schematic 





(c) LOD2 (d) LOD3 


Fig. 10. LOD rendering models 


Through the establishment of different level of detail complex models and drawing 
corresponding levels, it can realize the optimization of computing resources on the basis 
of ensuring visual effects. The system is based on Unity3d engine. By applying LOD 
Group components and add different level models, optimal resource adjustment is 
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achieved. Taking the tree model as an example, the LOD level detail rendering technique 
is designed as shown in Fig. 10. 


2.3 Correspondence of the Virtual and Real 


2.3.1 Correspondence of the Climbing Surface 
When the machine’s amplitude changes, the position and rotation of the tracker will 
change at the same time. Since the rotation center or axis of the virtual mountain surface 
or the rock climbing surface is different, it is necessary to transform the position and 
rotation of the tracker then set it as the virtual mountain surface’s ones in order to super- 
impose the virtual mountain surface on the rock climbing surface after the machine’s 
amplitude changed. 

A script, named Accessory.cs, is attached to the virtual mountain surface in order to 
fix the relative position and rotation of Vive Tracker and the rock climbing machine. 
Part of the code is shown below [15]: 


//Get current Tracker pose 


Vector3 tracker_position = 
SteamVR_ Controller Input(devicelndex).transform.pos + 
GameObject.Find("LMHeadMountedRig").transform.position; 


Quaternion tracker_rotation = 
SteamVR_ Controller Input(devicelndex).transform.rot * 
Game Object.Find("LMHeadMountedRig").transform.rotation, 


//Transform current Tracker pose to Accessory pose 
this.transform.rotation = tracker_rotation * delta_rotation; 


this.transform.position = tracker_position + (tracker_rotation * 
delta rotation) * delta displacement; 


2.3.2 Correspondence of the Climbing Holds 

In order to match the hold in the virtual system, it is necessary to convert the plane 
position data to the position data of the hold relative to the starting point. We firstly 
obtain the relative position of the tracker and the zeroth hold on the zeroth plate, and 
thus deduce the position where the virtual hold should be, and then use this as the starting 
point for the virtual holds. In the virtual scene, we create an empty GameObject as the 
plane where all holds are placed, named HoldsGroup. We load each hold in accordance 
with the distance, converted from the path file, of the hold relative to the starting point, 
and make it perpendicular to the HoldsGroup plane. Finally, we superimpose the 
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HoldsGroup plane on the virtual mountain surface so that the virtual hold can coincide 
with the real hold. 

Besides, we acquire the running speed of the rock climbing machine by using encoder 
during the operation of it, and send it to PC via Serial Port. When the machine is running, 
the rock climbing surface will rotate and the real holds will continuously move down. 
We move up the Room in the VR environment at the speed received from machine so 
as to match the virtual hold with the real hold through. 


3 Interactive Device 


3.1 Leap Motion 


Leap Motion is use as hands and fingers tracking device [16]. In order to use Leap Motion 
normally, we need to install Leap Motion Orion software and download and import the 
UNITY CORE ASSETS package to the Unity3D “IndoorClimbing” project. Then we 
drag a LMHeadMountedRig into the scene from the path: LeapMotion/Prefabs. The 
LMHeadMountedRig prefab uses the camera location provided by Unity to place the 
LeapHandController at the correct position in the virtual world. The coordinates in the 
tracking data are then transformed from Leap space to Unity space relative to the position 
and orientation of the LeapHandController game object. 

We import the UI Input module as well in order to control the standard Unity UI 
widgets by naked hands. The primary component of the UI input module is the LeapE- 
ventSystem prefab. The LeapInputModule script component of this prefab implements 
a Unity Input Module that uses tracking data to allow the user to manipulate standard 
UI controls with their hands. 


3.2 Vive Tracker 


Vive Tracker have many modes to communicate with PC, the mode we used is that track 
moving objects using a wireless interface in VR, with the accessory (refer to the rock 
climbing machine) passing data to a PC via ‘“FATFS-SDIO-USB’ [15]. The dongle is 
used to transfer tracking data from the Vive Tracker to a PC, but the accessory transfers 
data to/from a PC directly for a specific purpose based on our design that transfer the 
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path file to the rock climbing machine’s controller. The communication mode is shown 
in Fig. 11. 


3.3 The Rock Climbing Machine 


We have designed a rock climbing machine that is compatible with the immersive rock 
climbing system. The developed virtual reality-based machine enables interaction with the 
immersive rock climbing environment. There are four functional modules that were 
designed to achieve the experience of the indoor climbing as similar as outdoor climbing 
by superimposing virtual mountain surface on the rock climbing surface (Fig. 12). 





Fig. 12. The virtual mountain surface superimposed on the rock climbing surface 


The overall drive module controls the rock climbing machine to work through human 
body weight as well as hydraulic resistance, to achieve the dynamic balance of people 
climbing process. The path change module is used to achieve the path change of the rock 
climbing process during climbing. We design the variable amplitude module to change the 
slope of climbing machine through the worm gear mechanism in order to simulate the 
different slope of the rock in the real outdoor rock climbing. The security module is used 
to reduce the risk of the user during the climbing process and the failure rate of the machine. 


4 Conclusion 


Virtual Reality Technology as Emerging Technology has a huge development prospects. 
Climbing movement as a new movement is also increasingly favored by the masses. The 
immersive rock climbing system not only effectively solves the problem of rock climbing 
site, but also puts forward a good solution to the problem of repeated path of indoor 
climbing machine and dull experience. The system combines rock climbing machine with 
virtual reality technology to achieve the experience of indoor rock climbing more intense, 
more authenticity. The system make the climbing movement, whose risk factor is large, 
into a safe indoor movement, and broaden the range of people who can enjoy the rock 
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climbing. Users can stay at home to enjoy the beauty of the world’s major famous climbing 
mountains via this system. This project is of great significance to the development of rock 
climbing and the combination of virtual reality technology and sports [17]. 
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Abstract. A 3 Degree-Of-Freedom (DOF) mobile handling robot model is 
designed by SolidWorks package. The link coordinate system and kinematics 
equation is established by the D-H parameter method. The kinematics and inverse 
kinematics is solved and it offers theoretical foundation. And then, the kinematics 
simulation analysis of the robot is carried by using ADAMS-multi-body dynamic 
simulation software. Finally, the trajectory planning is carried out by MATLAB. 
It establishes the foundation for the structure design, dynamic analysis and control 
system of the robot. 


Keywords: D-H parameter method - Handling robot - Kinematic 
Adam simulation - Coordinate system 


1 Introduction 


The handling robot made in KUKA, ABB and so on has been used in the production 
process of assembly, welding, handling and etc. [1]. With the robot technology gradually 
becoming mature, handling robots have presented an important support in the field of 
industrial production [2]. In microelectronics manufacturing, welding, packaging and 
many other areas, handling robots have been widely used [3]. At present, the handling 
robot has been widely used in the national economy in all important aspects [4]. But, 
there are less robots which combines motion and handling. In fact, robots combining 
motion and handling are necessary and can play important role in many fields. 

In this paper, a mobile handling robot model is established. And in the end the 
displacement of the joints is analyzed, which provides the basis for the debugging of 
handling robot, saving the on-site debugging time and protecting the handling robot in 
an extent [5]. 


2 Kinematics Models of 3-DOF Mobile Handling Robot 


The robot designed in this paper is a 3 DOF mobile handling robot. Three manipulators 
are rotating joints and the rotation axes are parallel to each other. As shown in Fig. 1, 
the front mobile platform (1) is equipped with a servo for the steering of the mobile 
platform, and the servo implements the forward and backward commands of the 
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platform. The joints (7), (8), (9) are individually driven by a servo to achieve steering. 
All servos are controlled by the controller (3). The end part of the arm (6) can be provided 
with different clamping mechanisms depending on the work object. In the actual process, 
through the mobile platform, the object can be held and moved from the target point 1 
to the target point 2. 





Fig. 1. Three - dimensional simplified model of mobile handling robot. 1-Mobile platform, 2- 
Additional weight, 3-Controller, 4-Big arm, 5-Middle arm, 6-End arm, 7-Joint 3, 8-Joint 2, 9-Joint 1 


As the robot actually works, the movement and handling are not carried out at the 
same time. Therefore, the model can be simplified in the analysis of the robot’s handling 
characteristics. The mobile part of the robot can be neglected temperately, and only 
manipulator is analyzed. According to the D-H parameter method, the coordinate system 
of the links in the manipulator is built. The coordinate system of each joint is established 
at the end of the link. Each coordinate system draws only two axes in the Fig. 2. 





Fig. 2. Link coordinator system of the mobile handling robot 


According to the link coordinate system established in Fig. 2, the corresponding 
D-H parameter table can be obtained, as shown in Table 1. 
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Table 1. D-H parameter table of the mobile handling robot 





yf Sql =| 2 2 (1) 


In the analysis, the geometric and joint variables of links are supposed to be known, 
and the position and posture of the robot terminal actuator relative to the reference 
coordinate system can be solved [6]. The inverse kinematics means that the position and 
posture of the robot terminal actuator are provided to derive the joint parameters. In 
order to control the robot arm, the inverse kinematics solution is the basis of the robot 
movement [7]. It is because that in the practically work, the variables of all joints are 
determined according to the position of target point. 

For a given position and posture, the transformation matrix of robot is deduced as 
formula (1), The values of the joint variable 0,, 8,, 0, can be solved by using corre- 
sponding inverse transformation matrix multiply Eq. (1), as shown in Table 2: 


Table 2. The algebraic solution of each joint 


6(°) The algebraic solution 


6, arctan (a,/a,) 
0, arctan ((a,cO, = a,s0,)/(a,c0; + a,s0,)) 
0; arccos (c0, (n,c0, — nsô) = sð, (n,c0; F n,s0, ) ) 


3 Kinematics Analyses 


Virtual Prototyping Technology is a digital design method based on virtual prototyping, 
which can shorten product development cycles, improve product quality [8]. A three- 
dimensional simplified model of the robot is imported into an ADAMS virtual prototype 
for simulation [9]. Adam simulation is carried out to get the displacement curve of the 
marker points relative to the earth, as shown in Figs. 3, 4, 5 and 6 
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Fig. 3. The marker point displacement curve of joint 1 
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Fig. 4. The marker point displacement curve of joint 2 
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Fig. 5. The marker point displacement curve of joint 3 


As it can be seen from Fig. 3, the displacement of the mark point 1 does not change 
because the joint shaft 1 is connected to the seat (1.e., the base), so that the displacement 
does not change. From Figs. 4, 5 and 6, it is found that the displacement of the marker 
points in the Z direction does not change and there is no movement in the Z direction. 
In the X, Y direction, the displacement curve approximates in the form of a sin function. 
The results of the above ADAMS simulation are in accordance with the actual situation, 
which verifies the rationality of the designed structure. The kinematics analysis and 
research provide the theoretical and simulation basis for the subsequent product design. 
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Fig. 6. The marker point displacement curve of terminal link 3 
4 Trajectory Planning 


The trajectory planning is to map each joint variable to a smooth time function [10], 
which includes Cartesian space planning and joint space planning. The Cartesian space 
planning is to use the Cartesian coordinate points’ sequences of the end effector position 
to constitute a trajectory, which may reach the singularity. But the situation does not 
appear in the joint space planning. In this paper, the trajectory planning of robot is carried 
out by using the joint space planning method (Fig. 7). 


mobile handling robot 3-dimension model figure 





400 ~~ 400 


(mm) 


Fig. 7. Space model of robot 


As can be seen from Fig. 8, the kinematic curve in the x, y direction of the terminal 
manipulator can be seen as a trigonometric function, and the kinematic curve in the z 
direction can be considered basically no change, which is the same as the previous Adam 
simulation results. If the image is furtherly solved, the correlation graph of velocity and 
acceleration can be obtained. It can be seen that the trajectory is in accordance with the 
operational requirements of the mobile manipulator robot, indicating that the path plan- 
ning method is reasonable. 
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Fig. 8. Terminal manipulator 


5 Conclusion 


In this paper, the displacement curves of the three joints and the end point of the robot 
are obtained, and the rationality of the structural design is verified, which provides the 
theoretical and simulation basis for the next step of the robot design. The two-dimen- 
sional and three-dimensional displacement curves of the end of the manipulator further 
validate the rationality of the mechanism and the correctness of the Adam simulation. 
It establishes the foundation for the structure design, dynamic analysis and control 
system of the robot. 
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Abstract. In this paper, the wasted PET bottles were analyzed. With the experi- 
ment analysis and the related literature, we determine the feasibility of the wasted 
PET plastic bottles as 3D printing materials. We design and introduced the prin- 
ciple of the 3D print nozzle of using the positive and negative screw to extrude 
material. The feasibility of the design is demonstrated by three - dimensional 
modeling, physical experiment analysis and ANSYS simulation analysis. 


Keywords: FDM technology - PET material - Positive and negative screw 
The modification of material 


1 Introduction 


The “No.1” plastic bottle which is made of the PET material accounts for more than half 
of the plastic bottle market. The output is huge, but its recovery rate is low [1]. The PET 
material can be used as an excellent printing supplies, because of its strong adhesion 
between the layers, good mobility, easy carbonation and other advantages. However, 
due to the crystallization rate of the PET material is slow, the PET material can’t meet 
the requirements of 3D printing technology rapid prototyping. 

In recent years, FDM technology is the most used 3D printing technology, many 
open source desktop 3D printers are mostly using this program [2]. FDM printing 
supplies are mostly ABS and PLA [3], supplies are expensive, and ABS in the printing 
process will release toxic gases. The use of ABS and PLA for FDM printing has some 
shortcomings. For example, it is easy to break the wire, poor extrusion molded product 
and product strength is not enough and the surface accuracy is poor. To improve the 
printing performance, Kannan [4] added iron powder to the ABS, the addition of surface 
active agent material made of iron powder or ABS composite. You Shu studied the effect 
of 3D printing conditions on the mechanical properties of PLA plastics [5]. 

In this paper, we first studied the modification of materials. We have designed a 
nucleating agent addition device, positive and negative threaded screw extrusion device, 
heating and cooling system. The feasibility of the design is demonstrated by three- 
dimensional modeling, physical experiment analysis and ANSYS simulation analysis. 
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2 The Study on Modification of Materials 


The experiment two types of mainstream PET plastic bottles on the market were DSC 
experiments, the experimental results shown in Figs. 1 and 2. From the DSC experi- 
mental data obtained, wasted PET material Tm is 240 °C—255 °C, the difference between 
pure PET plastic bottle is not much, and it can be used for secondary processing. 
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Itis helpful to increase the crystallization rate and improve the mechanical properties 
of the finished product by adding nucleating agent to PET during the melting process 
[6]. The results showed that the use of inorganic nucleating agent talc powder to modify 
the PET material, when the nucleating agent to add the quality ratio of 5% [7], the 
crystallization effect is better, the printing effect is the best. 
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3 Nozzle Mechanical Structure 


3.1 Overview of the Overall Structure 


The design of new nozzle structure includes nucleating agent addition device, positive 
and negative threaded screw extrusion device, heating and cooling system. Figure 3 1s 
a graph model for explosion. 
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Fig. 3. Model explosion diagram Fig. 4. Schematic diagram of heat dissipation 





The nucleating agent adding device comprises a feeding screw rod and a material 
cylinder. It can achieve the purpose of accurately and stably adding nucleating agent. 
The extruding device is composed of a mixing cylinder and a positive and negative 
threaded screw extrusion device. 

The heating device is arranged outside the mixing cylinder and is heated by a heating 
rod, and the temperature is accurately controlled by a temperature sensor. The heat 
dissipation system adopts the principle of air pump blowing. The heat dissipation effect 
is better, the energy consumption is lower, and the noise is smaller. A schematic diagram 
of the heat sink is shown in Fig. 4. 


3.1.1 Nucleating Agent Adding Device 

Because the amount of nucleating agent is small, and the amount of added material is 
larger, it needs to be accurately controlled. In this paper, the screw feeding mechanism 
is used for feeding, the nucleating agent is stored in the material bed and sent to the feed 
bed continuously through an external device to ensure that the nucleating agent can be 
continuously transported to the pet cylinder. The feed screw of the nucleating agent is 
driven by a micro reduction motor, and the rotation speed of the motor is controlled 
accurately through the PWM wave, so that the nucleating agent in the screw thread gap 
can be accurately and stably added to the print nozzle. 


3.1.2 Positive and Negative Thread Extrusion Device 
Device section view as shown in Fig. 6 and screw thread. Both ends of the thread is the 
thread, the thread and the inner wall of the sleeve is tightly attached at both ends of a 
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thread feeding extrusion effect. The middle part is anti thread, and the screw thread and 
the inner wall of the sleeve are buffered by a certain clearance between the threads and 
the inner wall of the sleeve. PET sheet material into the above material inlet, PET mate- 
rial into the molten state, it will melt the screw thread of PET material is transported 
downwards and the formation of certain anti back flow in the thread, and after the 
nucleating agent is fully mixed, melted PET material is down from the nozzle at the 
lower end of the screw extrusion molding. 


3.1.3 Heating and Cooling System 
The heating device is arranged around the nozzle and is heated by the mixing device, 
and the inner part is connected with a heating rod and a temperature sensor. The heating 
rod when heating block temperature increases when the temperature is higher than the 
preset temperature, the heating rod stops working. When the temperature is lower than 
the preset temperature, the heating rod starts working, heating block temperature rise. 
By this dynamic adjustment process, the internal temperature of the screw extruder is 
maintained at about 250 °C by controlling the extrusion of the screw. 

The heat transfer is installed on the nozzle frame to dissipate heat at the nozzle, so 
as to ensure that the pet material can be crystallized and cooled in time when the nozzle 
is extruded. 


3.2 Experiments and Simulation Analysis 


3.2.1 Positive and Negative Screw Experiment Analysis 

First of all, the simulation results show that the positive and reverse directions screw 
can mix up the PET material with Polymer Nucleators. The experimental process is 
shown in Fig. 5. The screw can squeeze the molten PET smoothly. In addition, it 1s 
verified that the anti-thread can achieve the purpose of mixing. 





Fig. 5. Test simulation process 
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Figure 6 is the positive and negative screw in kind. Figure 7 shows the nozzle’s 
working process. 











Fig. 6. The positive and negative screw Fig. 7. The nozzle’s working process 


By analyzing the experiment results, We can obtain that: 


(1) The rate of discharging of the molten PET material depends on the screw’s speed. 
The faster the screw’s speed is, the faster the extrusion rate of molten PET material 
1S. 

(2) The molten PET material flows simultaneously in the cartridge and can be well 
blended with the nucleating agent. 

(3) The flow velocity is relatively stable and PET can extrude smoothly at the nozzle. 


3.2.2 Simulation Analysis of Heating and Cooling System 
We conduct thermal analysis of the nozzle’s heating and cooling modules by ANSYS 
Workbench [8]. Nozzle, heating block, mixing cylinder’s material is brass. Their thermal 
conductivity is 45 W/(m*K). Screw’s material is stainless steel. Its thermal conductivity 
is 14.6 W/(m*K). We know that the melting temperature of pet is between 245-255 °C. 
The optimum heating temperature of heating rod is obtained by thermal analysis. Nozzle 
surface’s convective heat transfer coefficient is set to 50 W/(m**K). The heating block 
is set to 35 W/(m**K). Mixing cylinder is set to 20 W/(m**K). By setting different 
heating temperatures for analysis. When the temperature of the heating rod is set to 
270 °C, the internal temperature of the mixing cylinder reaches 250 °C. Through the 
heat pump nozzle, the temperature dropped to 230 °C. This is consistent with the actual 
temperature requirement for pet melting during actual printing. 

Figure 8 is the temperature distribution nephogram, Fig. 9 is the picture of pressure 
nephogram. 
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Fig. 8. Temperature distribution nephogram Fig. 9. Pressure nephogram 


4 Conclusion 


The nozzle we designed completes the modification and extrusion molding of waste 
PET material. The whole process from the plastic bottle to the actual model can be 
realized. PET material does not produce toxic substances during printing which is more 
environmentally friendly Compared to ABS. However the melting range is too narrow. 
Suitable temperature is an important indicator to measure the printing effect. How to 
control the printing temperature accurately is a problem we need to solve. Due to the 
complex internal mechanical structure of the nozzle, the vibration amplitude of the 
nozzle has a great influence on the model printing accuracy. It is the important direction 
of mechanical debugging to make the printing more smoothly. This design has solved 
the current printer can’t use PET material as 3D printing raw materials. Waste PET 
material is cheap, the cost of printing is reduced by about 90% Compared to ABS and 
PLA materials which provides a new opportunity for the development and wide appli- 
cation of 3D printing technology [9]. 
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Abstract. Virtualization and cloud computing technologies enable modern data 
centers to consolidate various services and applications with the prevalent multi- 
core processor to improve resource utilization. However, service consolidation 
has the risk of degraded quality-of-service (QoS) due to uncontrolled contention 
for the shared last-level caches (LLC). Cache partitioning techniques are prom- 
ising to improve resource utilization as well as guarantee QoS. As the capacity 
of LLC in data center is ever growing and there have been some practical cache 
partitioning techniques implemented in production system. Although the parti- 
tioning schemes have been explored extensively, how to make effective use of 
partitioning is still an important problem in data center and not well understood. 
Given the varying cache configurations, the complex workload mixes of diverse 
memory characteristics, and the different overheads of partitioning algorithms, 
we do not always gain performance improvement with cache partitioning. In this 
paper, we are seeking to explore when partitioning. We investigate the impact of 
cache configurations, memory characteristic of program, and partitioning varia- 
tion to the performance gain under partitioning. We also identify several inter- 
esting findings and implications which help us in future cache system design and 
optimization for cloud data centers. 


Keywords: Cache partitioning - Memory architecture - Empirical study 


1 Introduction 


The poor resource utilization in data centers increases the total cost of ownership (TCO) 
of IT service. For example, the average utilization achieves only around 6% to 12% in 
Google’s data center [10]. Recent efforts in virtualization and cloud computing tech- 
nologies are promising to improve resource utilization by consolidating many services 
on the same server in data centers. However, co-locating more applications has the risk 
of degraded guality-of-service (QoS) due to uncontrolled contention for shared 
resources, primarily the last-level caches (LLC) [18, 19]. Researchers leverage cache 
partitioning techniques [17] to address the arbitrary access to the shared LLC. As an 
important performance isolation mechanism, cache partitioning has been proposed to 
restrict the available amount of shared cache lines that an applications can access when 
itco-runs with other workloads. Depending on the optimization target, generally, a cache 
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partitioning technique consists of a line allocation policy and a partitioning-enable 
mechanism, which determine the amount of lines a program receives, and enforce such 
allocation to be actually executed, respectively. Prior works explored many cache parti- 
tioning schemes implemented by hardware [8, 14, 15, 19], software [9, 20], or co- 
designed hardware and software [4, 6]. 

As the capacity of last-level cache (LLC) in modern x86 processors is ever growing, 
there have been some practical cache partitioning technologies implemented in produc- 
tion system and markets [4, 6]. It is important to understand how to make effective use 
of cache partitioning techniques in data center. Despite that cache partitioning techni- 
ques have been extensively explored, given the varying cache configurations, the 
complex workload mixes of diverse memory characteristics, and the different overheads 
of partitioning algorithms how to use cache partitioning effectively is still a problem and 
not well understood. We carry an experimental case study that we consolidate various 
quad-core workload mixes to co-run on the same multicore server that supports cache 
partitioning, and find that cache partitioning does not always work as an expected 
winner. Partitioning vanishes its expected performance gains for some workloads, even 
worse, results in unexpected performance degradation. 

We are very interested in the unexpected behavior of cache partitioning which has 
not ever been studied in prior work. In this paper, we are seeking to explore when parti- 
tioning works and when it does not, and how to make effective use of it. With an empir- 
ical study on a commonly used way-partitioning technique, we perform a thorough 
evaluation with the SPEC CPU2006 suite. We group the total 29 programs into 11 
subsets based on memory intensity analysis and way sensitive analysis, and construct 
77 workloads with various memory characteristics in terms of memory access intensity 
and associative way sensitivity. We investigate the impact to partitioning performance 
of many possible factors spanning from cache configurations, memory characteristic of 
program, and partitioning variations to the performance gain under partitioning. We also 
identify several interesting findings and implications which help us in future cache 
system design and optimization for cloud data centers. 

We highlight our key contributions here: 


e We identify an important problem on how to make effective use of cache partitioning 
technology which has not ever been well understood by prior studies. To seek answers 
to this problem, we perform a detailed empirical study on a commonly used way- 
partitioning technology for a variety of workload mixes with different memory access 
intensity and associative way sensitivity, and study the impact of cache configura- 
tions, memory characteristics of program, and the partitioning variation to the 
performance gain under partitioning quantitatively. 

e We identity several interesting findings and implications which help us in future 
cache system design and optimization for cloud data centers. (1) There is a close 
correlation between cache configurations and the performance gains from cache 
partitioning where cache partitioning does not work in caches of small capacity. 
Increasing the set number can improve the performance gains of cache partitioning. 
Moreover, it is not beneficial to design high-associative caches in data center because 
they do not make cache partitioning work better. (11) Consolidating services in data 
centers should take the memory characteristics of programs into account. It degrades 
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performance when cache partitioning is applied to programs that have less similarity 
in memory access intensity of high memory intensity. Co-scheduling programs of 
heterogeneous cache way sensitivity can make effective use of cache partitioning. 
(111) The overhead of dynamic partitioning adjustment impacts the performance gains 
of cache partitioning. Design a partitioning algorithm with a steady partitioning size 
from less frequently adjustment can improve performance. 


2 Motivation 


Cache partitioning is an important performance isolation mechanism in use of guaran- 
teeing QoS. Table 1 summarizes the commonly used partitioning schemes during the 
past research efforts. Hardware-based implementations require architecture modifica- 
tion and are performance efficient. Software-based schemes are based on page coloring 
theory that leverage dynamic page allocation in OS to enforce pages to be scattered in 
contiguous caches. Co-HW/SW-based solutions combine the flexibility and efficiency 
of both software and hardware with low-level architecture extensions and a group of 
programming control routines. 


Table 1. Cache partitioning technologies 


Implementation Partitioning schemes 

Hardware Way-partitioning [8, 14, 15, 19] 
Software Page coloring [9, 20] 
Co-HW/SW Intel CAT [6] 


As the capacity of last-level cache (LLC) in modern x86 processors is ever growing. 
It is important to understand how to make effective use of cache partitioning techniques 
in data center. For example, Intel releases Xeon E7-8893 v4 processor equipped with 
an LLC of 60 MB, and IBM Power 8 associates a larger LLC of 96 MB. It is important 
to understand how to make effective use of cache partitioning techniques in data center. 
Although has been proposed for a decade around, way-partitioning has still been an 
active baseline in cache partitioning research [1, 12, 19], even in complex workload in 
context of cloud computing [3] and warehouse computer [7]. In practical production 
system and market, Intel releases a cache allocation techniques (CAT) [4, 6] in Haswell 
SKUs processor, which is also based on way-partitioning. 

Although cache partitioning techniques have been extensively explored, how to use 
cache partitioning effectively is still a problem and not well understood. We carry an 
experiment and find that cache partitioning does not always work as an expected winner. 
We consolidate various quad-core workload mixes to co-run on the same multicore 
server that supports CAT, and measure their performance under the baseline LRU and 
a CAT-based partitioning policy, respectively. Figure 1 reports the results under the 
baseline system and CAT-based partitioning. We find that cache partitioning has diverse 
impact on performance for different workload mixes. For some workloads cache parti- 
tioning outperforms LRU indeed as expected. But for some workloads, the expected 
benefits of partitioning vanish. Even worse, it result in unexpected performance 
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degradation for some workloads under partitioning. We are very interested in seeking 
the answer to this unexpected behavior which has not ever been reported in prior work. 
In this paper, we are motivated to explore when partitioning works and when it does not, 
and how to make effective use of it. 
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Fig. 1. Performance comparison for various workload mixes under the baseline LRU and the 
Intel CAT-based partitioning replacement policy 


3 Experimental Methodology 


3.1 Simulator 


We use an event-driven ZSim [16] to model our baseline multicore system. ZSim is 
a fast x86 micro-architecture simulator based on Intel Pin [11]. We closely model an 
Intel Sandy-bridge processor, of which configuration parameters are detailed 


Table 2. Simulation configurations of baseline multicore system 


Components Parameters 

Processor 4-core, 2.6 GHz, Out-of-Order, 4-issue width, 168-entry ROB, 
64-entry load queue, 32-entry store queue 

L1 Cache 32 KB, split instruction/data cache, 4-way associative, 
64-byte block size, 4-cycle latency, LRU replacement 

L2 Cache 256 KB, unified cache, 8-way associative, 64-byte block size, 
8-cycle latency, LRU replacement 

L3 Cache 4 MB, unified cache, 16-way associative, 64-byte block size, 
28-cycle latency, LRU replacement 

Memory 1 channel, 8 ranks, 8 banks, 64-bit bandwidth, 


1333 MHz bus frequency, 64-entry read queue, 64-entry write queue, 
open page management, 1 KB row buffer, 

tRCD = 15 ns, tCL = 15 ns, tRP = 15 ns, tW7R = 7.5 ns, 

tWR = 15 ns, BL/2 = 4, FR-FCFS request scheduler 
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presented in Table 2. Each core has its private split 32 KB 4-way associative L1 
instruction and data cache, and a unified 256 KB 8-way associative L2 cache. A 
unified 4 MB 16-way associative L3 cache (LLC) is shared by all cores. The default 
LLC replacement policy in our baseline system is LRU. We choose Utility Cache 
Partitioning (UCP) [14] as way-partitioning candidate, and implement an UCP-based 
replacement policy, labeled as WayPart in Sect. 4. We configure UCP with a per- 
core 256-line 16-way associative UMON circuits, and enforce the Lookahead algo- 
rithm to dynamically resize the partitions every 5000 cycles, similar as [14]. 


3.2 Workloads 


The SPEC CPU2006 [5] suite is used to perform evaluation. We compile each program 
using GCC 6.2.1 with an optimization flag of -O3, and feed them with a single reference 
input. A representative slice of 500 million instructions for each program is identified 
with PinPoint [13]. To characterize the memory behavior, we perform both way sensi- 
tivity analysis and memory intensity analysis for the total 29 programs, respectively. 
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Fig. 2. Sensitivity of performance in terms of IPC to cache ways 


Way Sensitivity Analysis. We observe both the varied performance in terms of 
instructions per cycles (IPC) as we change the available way number a program can 
access. We classify these programs into four categories: Way insensitive (WI) programs 
do not experience performance improvement as cache way increases. Way sensitive 
(WS) programs have a positive correlation between performance improvement and 
increased ways. Way fit (WE) programs also improve performance as cache way 
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increases, but would not increase any more at some point even more ways are assigned. 
Way step (WT) programs improve performance as cache way increases, and they show 
an increased steady phase. Figure 2 depicts the correlation curves of performance and 
cache ways for typical workload within these categories. 


Memory Access Intensive Analysis. We also study the memory access intensity in 
terms of misses per thousand instructions (MPKI) and classify programs into four cate- 
gories: High intensive (HI) programs have MPKI larger than 30, thus generate very high 
memory pressure. Medium intensive (MI) programs have MPKI larger than 10 but less 
than 30. Low intensive (LI) programs have MPKI larger than 1 but less than 10. Non- 
intensive (NI) programs have lower MPKI than 1 and thus they do not contend for the 
memory bandwidth. 

Combining both memory intensity and way sensitivity, we group the total 29 
programs into 11 subsets as shown in Table 3. We perform evaluation with a total of 77 
quad-workloads mixed with programs from these categories by varying memory inten- 
sity and way sensitivity. 


Table 3. Memory characteristics of SPEC CPU2006 in a 4 MB shared LLC 


# Benchmarks 

GOI HE JWT 429.mcf, 473.astar 

G02 HI = ~—*<| WE 470.lbm 

G03 MI —~—~=|WT ~~—_—s«S 437 .leslie3d, 482.sphinx3, 459.GemsFDTD 
G04 |MI =——~—'« | WS-~—S—Ss=S: 450. sp lex, 471.omnetpp, 483.xalancbmk 
G05 MI WI |462libquantum, 433.milc 

G06 a Mm _ 434.zeusmp, 445.gobmk, 436.cactusADM 


G07 447.dealll 


G08 u =- 400.perlbench, 435.gromacs 

G09 mooo (i 458.sjeng, 403.gcc, 444.namd, 481.wrf, 
465.tonto, 453.povray, 416.gamess 

G10 aono ām 401.bzip2, 456.hmmer, 464.h264ref 


G11 410.bwaves, 454.calculix 


3.3 Simulation Control 


We leverage a most common simulation control mechanism used in the past researches 
in cache memories [8, 14, 16]. We have caches warmed up with the subsequent 1 billion 
instructions, and have each program detailed executed at least 500 million instruction. 
If some program finishes earlier, they continue to run to contend for shared cache with 
other co-running programs. We only report performance for the first 500 million instruc- 
tions interval. Performance are measured with weighted speedup [2] calculated as 


IP Cw. 


Weighted Speedup = ` ee 
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where both the pC and PC are performances of the i-th program when it co- 
runs with other programs, and that when it runs alone, respectively. 


4 Empirical Studies 


In this section, we report the results of our empirical studies. We are restricted to present 
the evaluation to a subset of these workloads due to page limit instead of the total 77 
workload mixes, nevertheless, the results are also applied to the remaining workloads. 
Specially, we perform evaluations on workload mixes of medium memory intensity and 
varying way sensitivity as shown in Table 4 in studying the impact of cache configura- 
tion, cache way sensitivity, and partitioning variations to performance. We use workload 
mixes of way step sensitivity and varying memory intensity as shown in Table 5 to 
investigate the impact of memory intensity to performance. 


Table 4. Workload mixes of medium memory intensity and varying way sensitivity 


Category Category Workload mixes 

4WS (437,482,459,437) 
4WI (450,471,483,462) 
3WS + IWT (471,483,462,433) 
2WS + IWI + IWT (471,483,437,482) 
IWS + IWI + 2WT (450,462,433 ,462) 
IWS + 2WI + 1WT (471,437,482,459) 


Table 5. Workload mixes of way step sensitivity and varying memory intensity 


Category Category Workload mixes 
1HI + 3LI (473,459,436,435) 
1HI + 3MI (473,437,434,445) 
2HI + 2MI (429,482,435,400) 
2HI + 2L1 (429,482,459,434) 
2HI + 2NI (473,437,482,400) 
3HI + IMI (429,473,459,434) 
3HI + ILI (429,473,437,435) 
3HI + 1NI (429,473,445,400) 
IH + 3NI_| (429.400.435.400) | 


4.1 Impact of Cache Configuration 


Cache Set Number. Figure 3 reports the normalized performances across a group of 
cache size from 4 MB to 16 MB by increasing the number of cache sets while keeping 
a fixed cache way number. We can see that the set number has a direct impact of 
performance gains of way-partitioning. To our surprised, the baseline LRU outperforms 
way-partitioning for most workloads (7 out of 12) with a small cache of 4 MB. As we 


H. Qin 
increase the set number, way-partitioning improves these workloads gradually. When 


the cache size is increase to 16 MB, besides 2 workloads which have the similar 


performance under both partitioning and LRU, way-partitioning wins LRU for 10 out 
of the total 12 workloads, including the 7 workloads which has poor performance in 


small caches of 4 MB. Way-partitioning achieves a normalized performance improve- 


ment to LRU at a geometric mean of 11.5% in the large cache of 16 MB. 
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set numbers 


same 
Considering that 


(b) 32-way 
way and 32-way. 


We set a fixed cache size of 4 MB but varying the associativity 
respectively. Figure 4 reports the normalized performances of 


(a) 16-way 
expect to see performance improvement for workload mixes containing more cache way 


the workloads used in this evaluation are picked in terms of cache way sensitivity, we 
sensitive programs under a high associative cache. However, we hardly observe any 
performance improvement for all workload mixes, only by a geometric mean of 1%, as 
we increase the cache associativity. Compared with the impact of increasing the cache 


Fig. 4. Normalized performance of way-partitioning to LRU in LLC of the 
but different way numbers 


way-partitioning to LRU with an associativity of 16 


Cache Way Number. 
to 16-way and 32-way, 
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set number, the performance improvement from increasing cache way number is tiny 
(1% v.s. 11.5%). 

The different impact of cache set number and cache way number attributes to the 
amount of conflict misses. As the cache set number increases, the address stream is 
scattered among more cache sets, consequently, decreases the conflict misses within a 
cache set. However, increasing the cache way number decreases capacity misses but 
does not prevent conflict misses. Excessive conflict misses diminish the performance 
gains form cache partitioning. 


4.2 Impact of Memory Characteristics 


Memory Access Intensity. We investigate the impact of memory access intensity with 
the mixed workloads of way step sensitivity and various memory intensity shown in 
Table 5. The normalized performance of cache partitioning to LRU are presented in 
Fig. 5. We observe an unexpected performance degradation for almost all of the work- 
loads with cache partitioning. Compared with LRU, the performance loss in way-parti- 
tioning reaches by a geometric mean of 17% and at most by 38%. Obviously, for this 
group of workload mixes, cache partitioning does not work at all. Further, we find that 
there is a negative correlation of performance loss with the similarity in memory access 
intensity of the mixed workloads. It is highly likely to suffer performance degradation 
when programs that have less similarity are co-scheduled. For example, the workloads 
labeled as LHI3LI and 3HI1LI both contain programs of high memory intensive and 
low memory intensive. The large diversity in memory access intensity, consequently, 
makes them suffer significant performance loss. In contrast, the performance of work- 
loads labeled as 3HI1MI and 3MI1HT has little impact because they have smaller 
diversity in memory access intensity. This negative correlation can be explained with 
the extra misses from partitioning interference. We compares the cache misses in LRU 
and way-partitioning for each program in the workload mixes of 3HI1LI as shown in 
Table 6. We can see with partitioning, the misses in 429 .mcf decreases but the misses 
for the remaining programs increase. The misses in 436.cactusADM increases by 
20x. Due to partitioning interference, extra misses are pushed to the co-scheduled part- 
ners. 
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Fig. 5. Normalized performance of way-partitioning to LRU for workload mixes of cache way 
step sensitivity programs under different memory intensities 
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Table 6. Cache misses comparison of workload mixes with high diverse memory access intensity 
under LRU and way-partitioning 


Cache misses 429 mcf 436.cacutsADM 
LRU 45.38 68.27 45.38 5.04 
WayPart 48.29 64.60 48.28 102.74 


Associative Way Sensitivity. We evaluate the impact of associative way sensitivity 
by enforce mixed workloads of fixed medium memory intensive while varying asso- 
ciative way sensitivities. Figure 3(a) shows the normalized performance of way-parti- 
tioning to that of LRU. Firstly, we observe a similar performance in a 4 MB shared LLC 
on average under either LRU or way-partitioning. Secondly, we observe that way-parti- 
tioning outperforms LRU for some workloads as a result of effective use of the allocated 
ways. For example, for workload mixes labeled as 1WS3WI, an extra performance of 
3% to 4% can be gained with cache partitioning. Thirdly, we observe that workload mixes 
that benefit from way-partitioning contain at least one way insensitive program, which 
is necessary but not sufficient. Way insensitive programs do not benefit from the extra 
lines received. For example, 462 .1ibquantum accesses cache lines with a streaming 
pattern and does not see any performance gains on receiving more ways. Consequently, 
cache partitioning preserve less ways for these programs, typically only 1 way. The 
remaining ways can be devoted to those highly utilize caches. Thirdly, we find perform- 
ance is highly dependent on co-runners when way insensitive program are scheduled 
with others. Workload mix labeled as 1WS1WI2WT has a performance lost due to the 
excessive line contention from programs of way step sensitivity. 





4.3 Impact of Partitioning Variation 


To correlate performance with programs that benefit from cache partitioning, we review 
the dynamic number of cache blocks each program receives during co-running. 
Figure 6 presents the number of blocks allocated by the Lookahead algorithm in UCP 
for a workload mix labeled as 1WS3WT consists of 450.soplex, 437.leslie3d, 
482.sphinx3, and 459.GemsFDTD. Firstly, we observe a negative correlation of 
partitioning size variation with cache size. In a 4 MB LLC, the number of allocated 
blocks varies frequently, which results a frequent partitioning adjustment, consequently, 
the overhead of partitioning increases. With the increment of cache set to 8 MB and 
16 MB, the variation comes down gradually. Secondly, we see that more performance 
improves as more steady the partition is. It implicates that frequent partition variation 
does not improve performance due to extra overhead of partitioning adjustment. 
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Fig. 6. Cache blocks received by each program in a medium memory intensive workload mixes 
consisting of one way sensitive and three way step sensitive programs under the Lookahead 
partitioning algorithm 


4.4 Implications 


We summarize our findings and their implications to future cache system design or 
effective cache resource exploration for cloud data centers as follows. 


e Cache partitioning does not work in caches of small capacity. Increasing capacity by 
increasing the set number instead of the way number can improve the performance 
gains of cache partitioning. Considering the design complication, high overhead and 
energy of associative lookup, it is not beneficial to design high-associative caches in 
data center because they do not make cache partitioning work better. 

e Consolidating services in data centers should take the memory characteristics of 
programs into account since they have direct impact on performance gains of cache 
partitioning. It degrades performance when cache partitioning is applied to programs 
that have less similarity in memory access intensity of high memory intensity. Co- 
scheduling programs of heterogeneous cache way sensitivity can make effective use 
of cache partitioning. 

e The overhead of dynamic partitioning adjustment impacts the performance gains of 
cache partitioning. Design a partitioning algorithm with a steady partitioning size 
from less frequently adjustment can improve performance. 


5 Conclusion 


Service consolidation is promising to improve the poor resource utilization in cloud data 
centers but at a risk of suffering performance due to uncontrolled access to shared last- 
level cache. Although cache partitioning schemes have been exploited extensively, how 
to make effect use of cache partitioning 1s still not well understood. In this paper, we are 
seeking to explore when partitioning works and when it does not with an empirical study 
on a commonly used way-partitioning policy for a variety of workloads. We investigate 
the impact of cache configuration, memory characteristic of program, and partitioning 
variation to the performance gain under partitioning. We identify several interesting find- 
ings which help us in future cache system design and optimization for cloud data centers. 
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Abstract. With the rapid development of economy and the people’s growing 
material needs, increased frequency and intensity of railway transportation, the 
requirement of increasing the railway maintenance, security is becoming more 
and more attention. The current routine of daily maintenance is done mainly by 
manual and large rail inspection vehicles. The maintenance method is of high 
strength, low efficiency, high risk and low maintenance accuracy. Based on the 
above background, the project team has designed an efficient track inspection 
machine based on the collaborative working method of the mother-machine. The 
railway maintenance and data collection 1s achieved through the collaborative 
work of the mother-machine. In this case, the mother machine detects and collects 
the data, the sub-machine repairs and collects the data, the upper machine imple- 
ments the coordination, the big data processing and the feedback system. Data 
collected by a railway big data, to take advantage of these data, the team set up 
big data processing system based on hadoop, adopting clustering analysis, inte- 
grated analysis and time prediction analysis method, experience about defect 
distribution map, so as to optimize the workings of a composite aircraft, constantly 
improve the maintenance system based on composite aircraft performance. The 
design of the project team is based on the system of the railway maintenance 
system, which is intelligent and timely. Can be automated and dehumanized, 
realize railway maintenance, and can improve the efficiency of railway mainte- 
nance system and reduce cost. 


Keywords: Railway maintenance - Zipper - Big data analysis 
Intelligent system - Feedback optimization 


1 The Background Under the Time of Big Data 


Recent years, China’s railway development is rapid as China’s economy continues to 
rise. According to the data of the Ministry of Railways Statistics Center, China’s railway 
operating mileage of 91,000 km, while the annual passenger traffic volume up to 167609 
million passengers, the total amount of 364.27 million tons of cargo sent. China’s 
railway system 1s now six times a large area of acceleration transformation project, the 
introduction and development of high-speed railway speed of 350 km/h or more. 
What’s more, the safety factor of the railway is highly demanded with the continuous 
improvement of the railway. The daily maintenance of the track is mainly on the track 
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irregularity, orbital defect, bolt looseness and other issues detection and maintenance. 
However, because of the wide distribution of China’s railway and large the number of 
long tracks, China’s railway sector mainly rely on artificial or semi-automated machi- 
nery inspection method to detect and repair rail damage. Under this background, the 
project team designed an efficient rail maintenance machine which based on the coherent 
work mode of the parent machine. At the same time, the machine can also achieve the 
rail detection and maintenance of automation and intelligence. The most important is, 
people could establish a big data processing system on the hadoop based on the process 
of detection and maintenance of this machine. 

Although the current equipment has data collection and data feedback, but it did not 
achieve real-time interaction, the two part are distributed with a certain lag. The data 
can not be timely fed back to the data processing system through the host computer after 
the collection. And the data which has been dealt by the data processing terminal could 
not be sent back to the work of equipment in time, so it could not met the requirement 
of immediate maintenance. 

Through the collection of big data on the railway, the Hadoop for data processing 
and integration, and by the hadoop system for information feedback, the equipment 
could examines the defects of the relevant data more accurately according the relevant 
data. Besides, people could start a directional work through the integration of informa- 
tion and the types of defects and multiple locations. So the fixed section of the railway 
can be detected, and the efficiency can be improved. 


2 Data Detection and Acquisition 


From the background of railway’s development, we can see that there are two kinds of 
information on the existing railway tracks. Among them, irregularity refers to the orbital 
geometry, size and spatial position of the deviation. In the broadest sense, the position 
of the center line of the linear track, the height of the track, deviates; the curve center 
curve deviation; curvature, high, gauge value, slope changes in the size of the deviation, 
collectively referred to as the track is not smooth. Track surface defects include cracks 
of rail surface, abrasions, blocks, and the surface appears dark or black lines and so on. 





Fig. 1. Track maintenance machine 
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This information, needs to work through the way composite machine to work to collect. 
Track maintenance machine is as shown in Fig. | [1]. 

The acquisition process is divided into three parts. The first is the image, instrument 
and other information of data acquisition. Then these directly collected information, 
through the processing of data conversion, into the corresponding digital information. 
Finally, the data finished, and through digital-analog conversion, the digital information 
combined with positioning information into a digital group, and through the FPI bus-4G 
communication networks to the host computer. The concrete data of detection and 
acquisition methods are as follows: 


2.1 Uneven Information Detection 


In order to obtain the information of irregular, we are in accordance with the category 
of irregular, different types of track irregularity using different instruments for testing. 
Orbit irregularities can be divided into vertical orbit irregularities, transverse orbit irreg- 
ularities and complex orbit irregularities (Fig. 2). 
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Fig. 2. All sorts of schematic schematics 


In order to obtain the above non-smooth information, we use the parent machine in 
the detection of the parent machine for data collection. For different types of irregular- 
ities, we use different data collection. At work, each testing hardware works simultane- 
ously. Using displacement sensors to measure displacement variables, the actual gauge 
values are added to the standard gauge. Using gyroscope to get the machine’ side rolling 
Angle, the high value is calculated through trigonometric relation. The lateral acceler- 
ation of the vehicle is measured by the transverse acceleration sensor, and the integral 
is obtained by the horizontal displacement deviation. By using the vertical acceleration 
sensor, the vertical acceleration of the body is measured, and the integral obtains the 
vertical displacement offset, and the calculation is high and low [2]. 


2.2 Defect Information Detection 


Track surface defects include the cracks of rail surface, abrasions, blocks, and the surface 
appears dark or black lines and so on. The causes of defects can be divided into two 
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categories: One is the limitation of the manufacturing process, the defects of the steel 
rail during the forging process; The other is generated by the high intensity and high 
density fatigue wear of the locomotive and rail. 

As aresult, we use machine vision technology to detect defects caused by this insta- 
bility. By CCD camera track surface image acquisition in the first place, again by FPGA 
combined with single chip microcomputer to feature extraction, image through image 
processing the final defect type and location. Then, according to the image pixel infor- 
mation, you can coordinate the defect [3]. 


2.3 Location Information Detection 


Location information is mainly through GPS and encoder to achieve the absolute posi- 
tioning of geographic information and relative positioning. The GPS absolute posi- 
tioning is mainly the approximate range of the framing machine, the detected uneven 
information and defect information. Coding counter relative positioning is mainly based 
on the initial operation of the sub-machine positioning information and rail along the 
direction of information to achieve the relative positioning of detection information. 
Finally the use both of them to achieve the location of the defect location. 

After obtaining the above three kinds of information, through the FPGA to achieve 
the information of digital signal conversion, and through the FPI bus-4G network 
communications to the host computer, to achieve data collection and further processing. 


3 The Detection Data Processing and Analysis 


After the data collection is the corresponding data processing, in this project, the most 
important is the analysis of these data analysis and data feedback after this, you can 
simplify the construction process of maintenance machinery in the railway track, 
improve the efficiency of rail maintenance and management. 

For the railway track data can be divided into structured data and unstructured data 
two categories. Unstructured data refer to the data tracked by the rail maintenance 
machine in this project, the most important of which is the generating data after the 
image processing, and the processing of this part of the data is the key to the project. 


3.1 Big Data Characteristics 


As we collect and maintain data on railway maintenance, we accumulate large data 
resources. For these big data, we need to analyze it to get the distribution of the railway 
service. The big data has the following characteristics: 


(1) The amount of data 
Rail traffic is a wide range of distribution, long distance, daily life is essential a data 
entity, so in order to collect data from the railway track, traditional data-processing 
software can’t store or process such a huge amount of data, requiring large data to 
meet the project’s requirements. 
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(2) Data type diversification 
With the deepening of the project, the railway situation is intricate, so big data 
diversity the data type characteristics of the project to meet the early stage of the 
data processing. 

(3) The rapid spread of data 
In this project, it is hoped that the processing of the data on the railway situation 
will be sent to the staff s hand-held equipment in order to respond quickly to the 
abnormal state of the rail. 

(4) Low the data density 
Big data in the process of processing may need to deal with most of the nonsigni- 
ficant data, and ultimately will reflect it contains that part of the high value of the 
data or results. 


3.2 Big Data Processing Technology Based on Hadoop 


The data of the railway track is a lot of data, so the transmission and storage of these 
data are key problems to be solved. In the data transmission, data compression can 
effectively reduce the amount of network data transmission and improve the storage 
efficiency. We use the fault transient process of ascension based format signal compres- 
sion and reconstruction algorithm of real-time data, using linear integer transform 
biorthogonal wavelet filter combination Huffman encoding method of track detection 
of real-time data compression and decompression. Then, we need to decompress the 
data after the data arrived in the monitoring center, it needs appropriate computing and 
storage platform. In the data storage, because the orbital data on the real-time require- 
ments are not very high, so the amount of data that can be detected can be stored using 
Hadoop’s HDFS storage system. To meet the hadoop of the big data technology to the 
number of processing, we can continuously optimize the railway track detection system 
according to the data feedback mechanism, such as after long-term detection, we can 
determine the frequency of different location defects, so we can set a different detection 
frequency, to meet the detection and efficient management. 


4 The Data Analysis and Optimization 


In the sub-machine work together for some time, it will inevitably produce a large 
number of railway maintenance data, the analysis and optimization of the massive data 
have certain influence on the distribution and proportion of the work on the railway. 
Therefore, we are for the detection and collection of data, data processing, the analysis, 
to obtain big data based on the railway maintenance profile, that is, “track spectrum.” 
Through the data analysis of the track spectrum, we find the frequency distribution rule 
of the daily maintenance work, so as to optimize the working scheduling of the coop- 
erating sub-machine, realize the principle of the twenty-eight in the course of railway 
daily maintenance, improve the efficiency of overhaul and reduce the cost of mainte- 
nance. 
Data monitoring and analysis platform as shown in Fig. 3. 
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Fig. 3. Monitoring platform 


In order to analyse the data, we use two methods of clustering analysis and integration 
analysis on the basis of Hadoop, and carry out the comprehensive evaluation and analysis 
of the data by the two types of data and the state of track maintenance. Maintenance of 
the real state and its forecast analysis, and targeted adjustment of the proportional 
number of sub-machine maintenance and distribution of the scope of work. In the 
following, two analytical methods will be briefly introduced: 


4.1 Clustering Analysis of Prediction Intensity 


The clustering analysis is divided into two parts: the test set and the training set. The 
detailed analysis is as follows, Fig. 4. 
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Through the prediction of strength, we can get predictive state distribution of railway 
track defects and irregularities and bolts loosening, based on Hadoop, so as to make 
corresponding repair and distribution rules for the occurrence rate of defects of different 
frequency in different regions, and further to achieve the principle of twenty-eight, to 
improve the efficiency of railway railways daily maintenance, reduce the cost of railway 
railways daily maintenance. 


4.2 Integration Analysis 


Integration analysis is also an effective way to solve the problem of “big p small n”. so 
it is necessary to study the integration and analysis of different data sets in the era of big 
data [4]. 

Due to scattered data clustering analysis, analysis of incoherent, overall analysis is 
not comprehensive, we adopted the combination of integration analysis, for railway 
track maintenance and repair of large data analysis, realize the whole track along the 
integrity of big data analysis. 


4.3 Comprehensive Analysis 


Based on the results of the above two analyzes, we can obtain the comprehensive distri- 
bution of railway track defects, irregularities, bolts and other information, and predict 
the occurrence rate and time of occurrence of railway rails. For different regions, 
different intensity and the frequency of maintenance planning, improve the efficiency 
of maintenance sub-machine, reduce the corresponding cost, in order to achieve the 
railway track maintenance operations really “twenty-eighty principle.” Maintenance 
machine feedback workflow is shown as Fig. 5. 
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5 Summary 


Today, the main way of railway maintenance is still labour, it exists high duration, 
difficult and other defects. The current machinery can only be automated, can not be 
intelligent, can not meet today’s needs. 

This project designed a highly efficient intelligent rail maintenance machine, and 
based on this design based on the hadoop for big data processing system, and its collec- 
tion, processing, analysis and feedback maintenance machine to detect the massive data, 
the project has the following innovations: 


(1) Mechanical structure design adaptive strong, can be automatically detected main- 
tenance operations 

(2) Modular, integrated, systematic design can be adapted to a variety of complex 
operating requirements 

(3) Design of the data and instructions two-wire transmission function, to achieve the 
machine’s online detection and status monitoring, and real-time access to the state 
of the rail changes 

(4) Designed data on-line analysis and big data processing platform to achieve the 
analysis of the state of the railway maintenance and sub-machine work mechanism 
feedback, shorten costs and improve efficiency 


The project team uses cloud computing as a platform for heterogeneous and diver- 
sified data storage and analysis and the platform after the operation of the massive data 
based on the maintenance of state maintenance, system feedback optimization, isolated 
information system interoperability Support, and become a candidate after integration. 
This work has a low cost, good system scalability (unlimited storage capacity), high 
reliability, parallel analysis and so on, will become one important system of intelligent 
way of railway maintenance in the future. 


References 


1. Chen, C., Kong, J., et al.: Modern Mechanical Designer Manual. Mechanical Industry Press, 
Beijing (2014) 

2. Wang, Y., Yu, Z., Bia, B., Xu, X., Zhu, L.: Study on crack identification algorithm of metro 
tunnel based on image processing. J. Instrum. 07, 1489-1496 (2014) 

3. Wang, Y.: Study on key technology of big data processing flow based on Hadoop. Inf. Technol. 
09, 143-146, 151 (2014) 

4. Ma, S., Wang, X., Fang, K.: Integration analysis of big data. J. Stat. Res. 11, 3-11 (2015) 

5. Cao, Y.: Hadoop Performance Optimization in Big Data Environment. Dalian Maritime 
University (2013) 

6. Tang, D.: Hadoop-based affine propagation big data clustering analysis method. Comput. Eng. 
Appl. 04, 29-34 (2015) 


Crowdsourcing and Stigmergic Approaches 
for (Swarm) Intelligent Transportation Systems 


Salvatore Distefano!?(@), Giovanni Merlino!, Antonio Puliafito!, 
Davide Cerotti®, and Rustem Dautov? 


' Universita degli Studi di Messina, Messina, Italy 
{sdistefano,gmerlino,apuliafito}@unime.it 
2 Social and Urban Computing Group, Kazan Federal University, Kazan, Russia 
{s_distefano,rdautov}@it.kfu.ru 
3 Politecnico di Milano, Milano, Italy 
davide.cerotti@polimi.it 


Abstract. In the last decades, the impact of Information and Commu- 
nication Technologies (ICT) on transportation systems radically changed 
them, identifying in the Intelligent Transportation Systems (ITS) a new 
research area. A problem often addressed in ITS is vehicle routing, for 
which plenty of solutions have been already defined in literature. Vehicle 
routing problems are usually NP hard, therefore these are mainly heuris- 
tic solutions. A requirement for them is to be deployed and run in naviga- 
tion systems, ready to react to sudden changes in a (quasi) real-time way. 
Hence, to reduce the latency is still an open issue, not only depending 
on the complexity of the solution but also on other parameters, such as 
the traffic update latency in traffic-aware vehicle routing. A way to solve 
them is by exploiting distributed, collaborative approaches, establishing 
a proper collaboration platform and algorithms able to use it. Mobile 
Crowdsensing, on the one hand, and collective and swarm intelligence 
approaches, on the other, can fill this gap. This paper is a first attempt 
in this direction, aiming at defining a new class of (swarm) the Intelli- 
gent Transportation Systems (SITS), on top of a crowdsourcing-based 
infrastructure. 














Keywords: MANETs - Mobile crowdsensing - Stigmergy 
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1 Introduction and Motivations 





With an ever growing availability of embedded, mostly personal and mobile com- 
puting devices for everyday tasks, there is an almost limitless potential for tap- 
ping onboard resources, especially sensing-related ones, as well as corresponding 
compute nodes to be exploited for locally executable tasks. Mobile CrowdSens- 
ing (MCS) comprises by definition a category of applications where individuals 
carrying sensor-hosting embedded computers (e.g. smartphones) get collectively 
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engaged in information gathering and sharing efforts in order to analyze and 
georeference events which may be interesting for individuals and communities 
alike. MCS is already establishing itself as a trendy paradigm, but most efforts 
go into the direction of easing participatory (e.g. manned) patterns. Apart from 
privacy and security issues, where anonymization and sandboxing respectively 
are key countermeasures, as most engagement chances should be meant to be 
opportunistic to let MCS be truly widespread, inexpensive and wholly disrup- 
tive as a paradigm. In this context problems lie foremost in enabling unassisted 
deployments, as well as accommodating for peer-oriented communication and 
distributed self-organization, mostly due to real-world constraints, e.g. intermit- 
tently (WAN-)disconnected operation. 

One of the main advantages of MCS is the possibility to conduct sample col- 
lection, data mining, etc., without accounting for the corresponding experiments 
in advance, just leveraging natural daily life patterns arising from human activ- 
ities as they happen and leave behind breadcrumbs in form of samplings ready 
to be collected. Aim of this kind of enablement then is putting this power at the 
fingertips of developers or would-be entrepreneurs, ready to kickstart whichever 
effort in next to no time. In particular self-provisioning and autonomous cooper- 
ation are needed to avoid long setup times for experiments, disruptions beyond 
careful planning and sizing, as well as aiding coders in developing less custom 
logic. 

Most existing typical MCS applications currently feature a common, simpli- 
fied architecture, made up of two main components, one running on the embed- 
ded device in order to collect and disseminate measurements, and a second one 
as backend hosted on e.g. the Cloud for data mining, analytics and other busi- 
ness intelligence according to the requirements of the application at hand. Each 
application gets designed mostly from scratch with no common ground despite 
every implementation tackling, of necessity, overlapping challenges in schedul- 
ing, sampling, and communication duties, among others. A few drawbacks of 
such siloed pattern deserve to be pointed out as severe hindrances despite the 
promise of the underlying paradigm: 














— wasted development efforts, due to mostly ground-up coding every time, 
including OS and platform-dependent adaptations 

— unoptimized runtime, as multiple applications would execute on the same 
nodes without taking into account such configuration, possibly duplicating 
sensing or processing activities on resource-constrained devices, thus also lim- 
iting scalability of the platform 

— no exploitation of proximity or density in topologies, by e.g. cooperation 
across nodes. 





In particular this last point is crucial, as any kind of high-density scenario, espe- 
cially if with real-time constraints, e.g. Intelligent Transportation Systems (ITS) 
as we will see in the following, needs a smart approach to proactively take advan- 
tage of proximal nodes and crowded areas instead of crumbling under the weight 
of such scale. Indeed, such a strategy could translate either in (self-)throttling 
redundant devices or even letting node aggregates preprocess data and shape 
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trafic accordingly by e.g. network coding strategies, otherwise leaving the back- 
end (and the network itself) prone to scalability issues over huge population 
scenarios. 

Given the paradigm, i.e. MCS, and forthcoming use cases, with specific regard 
to ITS where mobility is really going to match crowds at scale, we conceived a 
design pattern. Such a scheme may lend resilience when faced with big swarms, 
while at the same time helping DevOps with their coding and deployment strate- 
gies. More specifically, collaborative approaches, collective and swarm intelligent 
ones could be good metaheuristics than can provide effective solutions to ITS 
problems such as vehicle routing, usually NP-hard. The collaboration of multiple 
agents is the best solution available in the ITS scenario to reduce latency and 
be effective in addressing the routing problem also when traffic conditions must 
be considered. This perfectly matches with crowdsourcing based paradigms as 
MCS, mainly aiming at supporting applications able to exploit the collective 
intelligence and their emerging properties in problem solving. In this paper, we 
propose to address the routing problem adopting a swarm intelligence approach 
able to take into account the road traffic conditions. It mainly consists of an ant 
colony optimization (ACO) algorithm implemented and deployed on mobiles 
constituting an MCS-contributed infrastructure, able to interact each other fol- 
lowing an opportunistic patterns. We therefore framed our approach in the class 
of Swarm ITS (SITS). 

In the following, we are going to first lay out an overview of IT'S systems, then 
discussing MCS solutions for them. After that, we focus on swarm intelligence, 
stigmergy and ACO, coming up with SITS. This way, we define our traffic- 
aware vehicle routing solution based on a modified version of an ACO. This 
SITS solution is thus evaluated by a simulator which preliminary results are 
discussed. Finally some remarks and hints for future work close the paper. 





2 Intelligent Transportation Systems 


Intelligent transportation systems (ITS) are the coherent combination of 
advanced systems, and services which aim at providing as a whole innovative 
solutions related to typically metropolitan and regional mobility based on mul- 
tiple modes of transport, by means of traffic management, as well as enabling 
users belonging to whichever category to be up-to-date, thoroughly informed 
and aware of any (transient or structural) issues, in order to make safer, more 
coordinated, and ‘smarter’ use of transport networks. A directive by the Euro- 
pean Union Commission defines ITS [1] as “systems in which information and 
communication technologies are applied in the field of road transport, includ- 
ing infrastructure, vehicles and users, and in traffic management and mobility 
management, as well as for interfaces with other modes of transport”. 

From the ICT perspective, we may envision ITS embracing any advanced 
solution for transport engineering that integrates live data and other feedback 
from a number of heterogeneous sources, such as parking guidance and infor- 
mation systems. In particular, efforts related to ITS seem naturally poised to 
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have as target high-population density areas considering a consistent orientation 
of such networks towards multimodal systems of transportation, be those either 
personal vehicles, shared vectors such as buses or trains. 

ITS naturally spans a wide range of technologies, in particular ICT ones, 
starting from basic management systems such as navigation ones, possibly to 
be augmented in the future by systems where artificial “co-drivers” may assist 
humans during their duties [2]. Yet, there are many other examples of instances 
of subsystems prone to be enhanced through ICT, e.g., traffic signal control 
systems, which may leverage some kind of system-optimal routing algorithm for 
road networks as well, such as game-theory based ones [3]. Moreover, from a 
technological viewpoint, any delay in information dissemination for vehicle-to- 
vehicle communication networks [4], so called VANETs [5], considering a traffic- 
dense configuration as the relevant scenario, can be identified as one of the 
main challenges to be overcome for any coordination system to really work as 
expected. Some authors [6] have leveraged Deep Learning to predict traffic flows 
by dealing with Big Data sources. Such problems were also analyzed by model- 
based solutions: for instance in [7] a stochastic (hazard-based) model to evaluate 
the impact of a reliability-safety tradeoff on the travel-time is proposed. 





3 A Crowdsourcing Infrastructure for ITS 


Typical MCS applications mainly implement a client-server interaction pattern 
where a service provider offers MCS-based services to end users, leveraging con- 
tributor willingness to provide their physical (sensing) resources [8]. Data are 
therefore collected and processed by (backend and frontend) servers to carry out 
aggregate analytics and feed back relevant results to end users. 

Starting from the lowest level, through heuristics and algorithm design, local 
analytics may provide a category of functions, among which simple ones are 
interpolation, extrapolation and outlier filtering, which may enhance a standard 
MCS application. This aspect is summarized and depicted in Fig. 1, where the 
main differences against the traditional MCS approach are highlighted in red. 
Indeed, the differentiation is related to an opportunistic, collaborative approach, 
where nodes may interact one another to aid local computations and perform 
distributed optimizations on a small/medium scale. This way, end users may 
leverage an MCS application, server interaction issues notwithstanding, by just 
exploiting cooperation among nodes. 

These designs may be quite application-specific, e.g., different crowdsensing 
applications would coexist, each bringing independently operating local analyt- 
ics, yet still possibly accessing the same readings or applying comparable infer- 
ence strategies. Anyway, local analytics provide data about a relatively confined 
area. There are applications with a different set of requirements, where some 
kind of aggregate analytics is needed, to be run at the backend. The main task 
in this case is extracting patterns from huge sets of sensor-provided data, origi- 
nating from large populations of mobile devices. Patterns may highlight features 
of certain physical (or social) environments of interest, also helping in building 
models about observed phenomena, a way to improve in forecasting. 
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Fig. 1. The MCS reference scenario. (Color figure online) 


Identifying patterns from large datasets means resorting to data mining, 
which calls for one of two approaches, according to either the size of incoming 
data or the limits imposed by applications on delays. In the first case measure- 
ments are preliminary stored in a database, to apply mining algorithms against 
whole datasets and detect patterns. When the input stream is continuous and 
overwhelming in terms of storage, or even when applications would require fast 
pattern recognition techniques, “streaming” algorithms for mining may be the 
only viable solutions to identify patterns from streams in flight, independently 
from subsequent treatment such as long-term storage strategies. Data mining 
algorithms usually require domain-specific expertise. 





4 Swarm Intelligent Transportation Systems 


Over such a set of (dynamic) meshes, we propose a stigmergic approach for 
cooperation and optimization. Let’s first tackle swarm optimization alone. 


Ant Colony Optimization. Ant colony optimization (ACO) [9] is a relatively 
recent metaheuristic based on the behavior of ants seeking a path between their 
colony and a source of food. In nature wandering ants have exhibited in this 
sense a provable capability to discover optimal paths. The collective intelligence 
of the swarm relies on ants exchanging information indirectly via the environ- 
ment (the so-called stigmergy). While traveling to search for food, ants lay down 
pheromones on their way back to the nest (i.e. home colony) only when sources 
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of food are found. As other colony members step into pheromone trails, they 
tend to stick to the beaten path accordingly. Moreover, the trace gets reinforced 
as individuals follow the same trail, leaving pheromone of their own, in turn 
resulting increasingly attractive for other ants. For any complex problem which 
can be reduced to a search for optimal paths, ACO may work as a probabilistic 
solver, by emulating such naturally occurring behavior. The probability Pi for 
an artificial ant k, placed in vertex i, to move toward node 7 is defined as follows: 


TA Ni 
DOIEN? (Ti ` ni) 


where 7;; corresponds to the quantity of pheromones laid over arc aij, nij to 
a-priori attractiveness of the move, computed by some heuristic embedding the 
cost of choosing arc a;; along the path that leads to the destination, and Nix, is 
the set of neighbours in node t for ant k, i.e., the nodes available for the ant to 
move in. Coefficients a and p are global parameters for the algorithm. According 
to typical ACO variants, ants bring food back home (i.e. nest) after being done 
with their movement. Denoting Tę as the tour of ant k, Ck is defined as the 
length of Tk, and used to specify the amount of pheromones to be placed by ant 
k on each arc on the trail that led to the food source: 


(1) 


k — 
Pij = 


1; _ ; 
k _ J OF if arc (i, j) belongs toT 
an= F otherwise (2) 
Tig = Tij + ` ATi. (3) 
k=1 


At the end of a round, after each ant has completed a move, the extent of 
pheromones laid over each arc gets reduced (e.g. evaporates), according to: 





Tij = (1 = p)Tij (4) 


where p is a global parameter as well. Such simplified form of the ACO is defined 
an “ant system” (AS). In Fig. 2 the pseudocode for an AS of size n is listed. 

ACO algorithms yield their best performance when some form of local search 
algorithm is employed. 


Modified ACO for Traffic-Aware Route Planning. To adapt ACOs to 
MCS applications, we propose here a novel version of an ACO, MoCSACO, 
extending the algorithm of Fig. 2 into the one of Fig. 3, where an ant corresponds 
to a (physical) mobile device, i.e., a real-world agent. 

We are also redefining the general objective of finding the shortest path 
on a (weighted) graph in terms of reusing common, state-of-the-art and readily 
available heuristics for path discovery. The A* search algorithm is such a solution, 
and leaves us free to confine stigmergy to weighting only, e.g. the admissible 
heuristic function in case of A*, where each arc has a cost (e.g., weight) defined 
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1: generate initial pheromone matrix P with respect to the graph topology 

2: generation < 0 

3: while termination criteria not met do 

4 place m ants randomly on graph vertices. set the amount of collected food for 
each ant to 0 

5 foreach ant do 

6 move forward in the graph. follow the probabilistic rule from eq. (1) 

T: compute the amount of collected food corresponding to ants trail 

8 

9 





end foreach 
find ant with largest (smallest) amount of collected food (traversed arcs). let 
the ant lay pheromones in P on his trail (back) according to eq. (3) 
10; evaporate pheromones in P according to eq. (4) 
11: generation + generation + 1 
12: end while 


Fig. 2. The behavior of ant system in pseudocode. 
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Fig. 3. The behavior of modified ant system in pseudocode. 
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by a certain metric. In its turn, weight directly correlates to the quantity of 
pheromones, as in: 


Wig = a] (5) 
where w,; is the weight of the arc, y is a constant of proportionality and Tij 
represents the amount of pheromones placed on arc aij. 

In order to make the a-priori cost (i.e., of choosing an arc along the path 
towards destination) explicit, we first define: 

Ci + Cid (6) 

the distance (i.e., cost) ch. from node 7 towards destination d along node j as 


the sum of that between 2 and j, cij, and that from j to destination, cjq. Then 
we specify the cost over a certain arc aij: 


Cij = hij/Wij (7) 
as the ratio between a certain property we want to use as metric, h;;, and its 
weight, Wij. 

Given all the above, the value of the a-priori gain, n. ; p for a certain choice 
1 — 
leading to destination d is computed according to the following formula: 


lia Oi, (8) 


a 





where the relationship is inversely proportional with respect to the (weighted) 
distance, and 0 is just a constant of proportionality. 

Following this pattern, one more fix, also applicable to the standard ACO 
variant, would consist in relaxing the requirement that agents travel back home 
after finding food, in its stead by leveraging the communication bus for near- 
instant swarm-wide dissemination of pheromone trails. 

Thus, a newly defined probability p$ for any artificial ant, placed in vertex 7, 
to move toward node 7, according to destination d, is defined as follows: 


TE” isd (9) 
eJ 
B DEN} (Ti ni) 


where 7;; corresponds to the quantity of pheromones laid over arc aij, nij to 
a-priori attractiveness of the choice, computed by some heuristic embedding the 
cost of choosing arc a;; along the path that leads to the destination, and Nik 
is the set of neighbors in node 7 for ant k, i.e., the admissible transitions for 
the ant. 

The amount of pheromone to be deposited depends on a metric for posterior 
costs, in general with no relation to h as defined in Eq.7. Even in this case, 
pheromone gets updated as defined in Eq. 3. 

As choices are unpredictable and there cannot be a notion of rounds for real- 
world agents, pheromone laid over each arc evaporates, still according to Eq. 4, 
but on a time basis, by setting timers. 
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Even in this case, far for the solution from losing generality, on the contrary 


the modification expands the scope of applicability, by relaxing constraints over 
the scenario. 


5 Preliminary Evaluation 


To evaluate our proposed technique, we examined the traffic of an urban area, 
near down-town the city of Messina. Arcs weights represent the length l;; of each 
road segment such that, using Eqs. 6 and 7 we can compute for each destination d 
the cost to take the arc (i, j) in the path towards node d without considering the 
road traffic. Providing these values as a metric to define the heuristic function 
of the A* search algorithm, we can find the shortest path length, which can be 
very different from the optimal path when the traffic is considered. We then 
implemented a simulator of the overall system. The evaluation of the model will 
provide a different metric to the heuristic function of the A* search algorithm 
that takes in account also the traffic. 

In the evaluation through the simulator we considered two scenarios where 
either we take into account the pheromone value or not. According to Eq. 9 this 
can be obtained by setting either a = —1 or a = 0. According to the results 
thus obtained, in both cases the flows of traffic concentrate on the neighbors 
of the specific destination and are directed towards it, thus confirming that 
the messages go in the right directions. In the case not taking into account 
pheromones we can observe a strong flow from node 1 to node 3 which is justified 
by the presence of a source in node 1. Moreover, the preferred paths used to 
reach node 3 are clearly visible. However, as expected, the pheromone effect 
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Fig. 4. Frequency distribution of the aggregated traffic obtained by simulation of the 
solutions with (Pher) and without pheromone (No Pher). 
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is to redistribute the traffic. In such case the flow on congested arcs decreases 
indeed, while underexploited arc one increase. 

To evaluate the effectiveness of the proposed approach in distributing the 
overall traffic on the graph, we investigated the distribution of the aggregated 
traffic (i.e. related to all destinations) for all the arcs. Since the traffic is evaluated 
as a rate of an exponential process, we can compute the aggregated one just as 
the sum of the traffics of each destination. 

The resulting frequency or probability mass functions of the traffic with and 
without pheromone are shown in Fig. 4. It can be observed that the pheromone 
improves the distribution of the overall traffic intensities. Indeed, without 
pheromone, several arcs present a high-intensity traffic, especially around the 
values 0.2 and 0.55. ‘The distribution thus obtained appears to be bimodal. When 
the pheromone is considered, the traffic results more evenly distributed, with a 
peak around the value 0.6. From these values, we can argue that the pheromone 
allows the overall traffic to be more evenly distributed than when the pheromone 
is not considered. 





6 Conclusions 


People, crowds and critical masses are becoming more and more powerful, not 
only from an abstract point of view related to the opinion they express, but 
also in more physical terms, due to their work potential. Indeed, volunteer and 
crowd-based approaches are spreading like wild-fire across different disciplines 
and sciences. Example are crowdsourcing, crowdfunding, geocomputing and vol- 
unteer geographic information, just to name a few in different areas. A very fertile 
ground for new approaches and technologies is computer science and engineering, 
where volunteer and crowd-based ones have gained solid roots as in the cases of 
crowdsourcing, crowdsearching, crowdsensing. In particular, Mobile Crowdsens- 
ing is a very promising approach for involving people, citizen, crowds into sensing 
campaigns according to participatory and/or opportunistic schemes. Although 
the Mobile Crowdsensing paradigm is quickly rising interests and funds, there 
is still untapped potential, as well as unexplored solutions this paradigm may 
empower. 

This paper tries to partially fill this gap by first defining a scenario and identi- 
fying some specific features for a novel opportunistic, distributed, self-organizing 
approach, applied to a specific class of MCS application, tackling local optimiza- 
tion problems, in the ITS domain. The solution proposed adapts and extends 
an ant colony optimization algorithm to a problem of pathfinding and graph 
traversal according to a given distance metric. This way a new class of intelli- 
gent transportation systems is identified: the swarm intelligent transportation 
systems - SITS - one. 
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Abstract. Objective To study some problems of visual feedback on user cogni- 
tion and satisfaction in virtual and real interaction with inconsistent input and 
output. Method Based on natural gestures, experiments were designed to study 
user’s perception of different visual feedback and satisfaction in virtual grasping, 
with the Ergo LAB physiological auxiliary testing equipment. Conclusion In the 
virtual and real interaction visual feedback, the feedback form of visual expres- 
sion is superior to the performance of movement, and the color is the main form. 
The feedback form of movement performance is stronger in physiological stim- 
ulation, deformation is the most feedback way of human cognition, which can 
increase the sense of immersion. In the interaction involving a variety of visual 
feedback, it is suggested giving priority to the classification of the same form of 
expression. 


Keywords: Virtual grasping - Virtual and real interaction - Visual feedback 
Variable analysis - Physiological index 


1 Introduction 


Virtual and real interaction is anew and emerging form of human-computer interaction, 
highlighting the virtual reality and augmented reality. With the help of powerful 
computing and graphics capabilities of computers, it achieves more intelligent under- 
standing of human order and enables more input modes [1]. 

Various input modes expand the interaction between virtual and real worlds. In the 
virtual and real interaction, the most reasonable and efficient interaction should be 
natural gestures. It allows users to abandon the external devices and interact with the 
virtual scene. 

There are plenty forms of feedback during virtual and real interaction, such as vision, 
hearing, touch, space conversion etc. In virtual and real interaction, the content is mainly 
based on the three-dimensional scene and model objects, and the input and output of 
information are different. 

The input of visual information accounts for more than 80% of the total intake of 
information, and can provide interactive feedback immediately, which has a positive 
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effect on reducing the user’s cognitive load. Since the feedback comes from the percep- 
tion of vision [2], environment and input gestures, the visual feedback plays an important 
role in virtual and real interaction. 


2 Related Work 


In view of the current research, the study of visual feedback among the virtual and real 
interaction was divided into the following three directions: hardware development and 
extensions, hardware algorithm based on recognition accuracy and natural gestures 
design. 

Mores Prachuabrued et al. explored the visual feedback when fingers penetrating 
virtual objects during virtual grasping and evaluated the performance (penetration, finger 
release time, accuracy) of several common visual feedbacks. Results showed that 
recommend visual feedback is color change [3]. And another experiment showed the 
combination of touch and visual feedback is optimal [4]. 

Faieza Abdul Aziz et al. studied visual feedback mechanism, they found out that 
users achieved higher efficiency in finishing assigned tasks with visual feedback. In 
addition, the results showed that color changes are more effective than text prompts [5]. 

Based on previous studies, YingKai designed visual feedback for virtual hand and 
grasping objects. She finally conclued:1. overall performance was better than the local 
performance. 2. visual change of objects was more effective than the change of virtual 
hand, 3. red color obtained the best feedback [6]. 

Zhang Wei et al. presented that static color recognition efficiency in virtual envi- 
ronment is far higher than in the real world, and suggested using dynamic [7] visual 
feedback in the virtual scene. 

Previous research had laid a certain foundation for related fields, but it also exposed 
some shortcomings. In the virtual and real interaction visual feedback, it is not enough 
only to study the changes of static visual feedback, and in the course of experiment, the 
level of the user’s operation is not accurately regulated, so it is necessary to use the 
auxiliary information of other channels to authenticate of conclusions under single visual 
feedback. 

In order to improve the research content in this field, on the basis of previous research, 
we conducted an experimental study on natural grasping gestures in the virtual reality 
interaction with respect to visual feedback. This paper mainly studied the difference and 
validity of visual feedback form of the target object. The feedback form variables include 
object color, transparency, flashing, highlight, rotation, shaking, scale, and deformation. 
Besides, auxiliary electromyography and heart rate test were applied for evaluation. 
Experimental results were analyzed. 


3 Experiment Design 


First, we define natural gestures. Natural gestures refer to gestures that do not add any 
markings, do not wear any auxiliary equipment, and use bare hand to interact with 
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objects directly. According to Wu and his co-author [8], a gesture includes Three stages, 
concluding the following Fig. 1: 


Touch Move 


Lift 
Gesture Gesture Gesture 
registration persistence termination 
f Be a 


Move 


Move 


Fig. 1. Three stages of the gesture input, OOR means the device “out of range” 


We defined the natural grasping gesture; sub-action can be divided into “encounter 
object” and “pick up the object”. We introduced the way of visual feedback. 

So how do you judge the validity of different visual feedback types in virtual reality 
interactions? In fact, it is the in process of the grab gesture from “encounter object” to 
“pick up the object”. From the “encounter object”, to see the visual feedback, and then 
quickly “pick up the object”, the moment of seeing the visual feedback will differ 
because of the different kinds of visual feedback; this difference can represent the 
difference of feedback ways, and then we can analyze its effectiveness. On this basis, 
we construct the experimental platform, and do the experimental study. 


3.1 Design Experiment 


The experiment used Unity 5.4 as the platform, C# as the programming language, and 
Leap Motion camera to input the gesture coordinates and use grasping gestures to 
perform experimental operations and record data. 

At the same time, the Ergo LAB physiological detection equipment was used to 
detect the EMG and heart rate in the experiment. The physiological indexes of the 
subjects were recorded under different visual feedback conditions (Fig. 2). The EMG 
detector (SEMG) electrode was pasted at the extensor carpi radials brevis muscle of the 
right arm [9], which was the main muscle of the activity in the operation of the grasping 
gesture. 
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Fig. 2. Based on leap motion system and Ergo LAB, visual feedback - cognitive experiment 


In this study, we use controlled-variable single factor inner-group comparison 
experiment and inter-group comparison experiment. Inner-group comparison experi- 
ments included the use of Leap Motion to achieve the virtual grab of the ball without 
feedback and in eight control variables of feedback, a total of nine groups. Inter-group 
comparison experiment included grabbing empty and grabbing solid table tennis, a total 
of two groups. 


3.2 Control Arguments 


Object color, transparency change - For the feedback form in this study, the independent 
variable of object color change was red, the self-variable of object transparency change 
was alpha equal to 0.4 concluding from the literature [6]. 

Object flashing - For the feedback form in this study, the independent variable of 
object blink control was the red and transparent alternation of the object in this paper [7]. 

Object highlighting — The paper [10] suggested that in the three-dimensional virtual 
scene, the use of highlighted form to interact with object was desirable. So, we designed 
highlighting color selection red, display on the edge of the small ball. 

Rotation, shaking, and scaling of objects — The author [11] used scaling, translation 
and rotation operations in interactive operations of 3D models in virtual reality, so we 
chose these three ways in the experiment. 

Object deformation - When the virtual finger is grasping, the places where fingers 
and small ball collide will have deformation [12], which control the deformation of the 
radius and the intensity in the appropriate range, in order to achieve the deformation of 
the feedback form. The interface was shown in Fig. 3. 


Research on Visual Feedback Based on Natural Gesture 631 





Flashing-feedback Highlight-feedback 





Shaking-feedback Scale-feedback Deformation-feedback 


Fig. 3. Visual feedback experimental interface overall (Color figure online) 


3.3 Experiment Process 


After pre-tests and screening, 20 participants were selected: 10 boys and 10 girls, aged 
from 20 to 25 (mean age 23.4), were all right handed and familiar with computer oper- 
ation. All participants were in good health and no symptoms of physical fatigue such as 
massive exercise before the test. All participants were willing to accept the experiment 
voluntarily. 

Before the start of experiment, the Ergo LAB device will be fixed, and the heart 
index detector (HRV) was clamped on the left index finger (HRV) of the subject, wipe 
the right side of the arm extensor parts with alcohol and fixed the electromyography 
(SEMG). 

The experiment will begin after the detection of heart rate and myoelectric physio- 
logical signal remaining stable. The experiment was divided into 2 groups and 11 
experiments. Each process requires 3 s of rest after the participant completes each 
grasping, each procedure include 5 times of this process. After each procedure, let the 
participant take a short break while saving Ergo LAB physiological data. And so on, 
followed by the implementation of 11 experimental process, the data will automatically 
be saved in the local file. After the experiment was completed, the subjects were graded 
on the satisfaction of several visual feedback methods, and they were graded from -2, 
-1, 1, and 2 by Likert 5 scale. 


3.4 Experiment Data 


The experimental data included response time AT, satisfaction S, mean electromyog- 
raphy Y Average, mean muscle contraction Y Variance and mean heart rate AVHR. 
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Reaction time AT = T1—T2, which T1 was the time when the subject encounters the 
ball to the release of the ball, T2 was the time when the user begin performing the 
grasping gesture to the release. Therefore, AT could be expressed as the user to see the 
feedback to the beginning of the implementation of the time difference, that is, the reac- 
tion time, as the form of feedback effectiveness of the important indicators. Satisfaction 
S was the satisfaction of the user after the completion of the experiment, the evaluation 
of the eight variables of the satisfaction score, respectively, -2, -1, 0, 1, 2, post-data 
analysis was standardized, all returned to O to -1 score. 

The physiological signal data were sorted out in the Ergo LAB analysis software, 
and the fragments from the signal equalization change to the signal equalization change 
were selected and processed for analysis. Analysis of SEMG signal according to the 
literature [13] selected these three data indicators commonly used in time domain and 
frequency domain analysis. Of this, YAverage represented the average level of the 
physiological signal amplitude of the segment, Y Variance represented the amplitude 
variance of the physiological signal of the fragment. They all obey to the zero mean 
Gaussian distribution, proportional to the degree of muscle contraction; AVHR repre- 
sents the mean heart rate of the segment. 


4 Experiment Design 


4.1 Inner-Group Analysis 


After 20 person-times of experiments, a large amount of statistical data was obtained. 
After the invalid data was eliminated, the data were consolidated and imported into the 
SPSS software for analysis and processing. The first test of the AT and S data obey 
normal distribution, results all meet. Descriptive statistics are then given in Table | 
below. 


Table 1. Descriptive statistics of variable response time and satisfaction 


The average | Standard 
of AT of S deviation 
Color-feedback 1.30 1102680 
Transparency -feedback 0.250 1800684 
Flashing-feedback —0.20 1707118 
Highlight-feedback 1.40 1706103 
Rotation-feedback ~0.65 1851743 
Shake-feedback ~0.95 1803534 
Scale-feedback —0.55 2113159 
Deformation-feedback 0.35 1518762 
None-feedback 20 +|.960270 | ~~ | 2135176 
Effective case number 
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It can be seen from the chart that the first AT without feedback was much larger than 
that of the other eight feedback modes, indicating that the operational efficiency value 
in the form of visual feedback was improved and the visual feedback mode is reasonable 
and necessary; 

The AT of the highlight-feedback was the smallest, and the user satisfactions were 
also the highest. The AT of these four modes of feedback such as rotation-feedback, 
shake-feedback, scale-feedback, and deformation-feedback were too large, and 
customer satisfaction were relatively low, the initial description of these the form of 
feedback was inappropriate. 

In order to test the rationality of the AT and S data results, AT and S were tested for 
paired samples T, and the results showed that, in addition to the transparent and deformed 
feedback methods, the other six feedback patterns had a significant correlation in the 
relevance of the sample sig <0.05, showing a high degree of correlation, which was also 
consistent with the description of the statistical feedback time is small and high satis- 
faction Feedback time and satisfaction with low consistency. 

In the process of data analysis, we found that the results of different variables showed 
a certain consistency, such as color, transparency, highlight the feedback method are 
better. And rotation, scaling, shaking, deformation of these feedback methods was rela- 
tively poor, but the results were also close to the guess may also be a certain degree of 
relevance. Therefore, cluster analysis was performed and the clustering results were 
shown in Fig. 4. 
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Fig. 4. Systematic clustering result pedigree 
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As you can see, shake-feedback, scale-feedback, deformation-feedback, and rota- 
tion-feedback are automatically clustered together, and the flashing-feedback, transpar- 
ency-feedback, highlight-feedback, and color-feedback came together. The eight vari- 
ables could be classified into two types, summed up the visual display and movement 
performance. 


4.2 Inter-group Analysis 


Considering that single visual input information could be deceptive, it was necessary to 
continue to analyze and validate from the perspective of physiological indicators, refer 
to Figs. 5 and 6. 


200 | Osta: EMG3-sjap4 4 Current Vakse 0.00 uV 





Fig. 5. Diagram of sEMG and HRV signals 
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Fig. 6. sEMG in different experimental grasping processes, empty-grasping, grasping table- 
tennis, none-feedback and highlight-feedback 


The data of the physiological indexes after the errata was sorted out, and the mean 
value of the case variables was output. At the same time, the satisfaction score was 
normalized and the overall data results were presented in Table 2. 

From the overall analysis, the physiological index data of different feedback forms 
was different, we could guess that there existence a certain psychological model between 
the visual and tactile perception. 
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Table 2. Experimental data cross table 


YVariance | AVHR 
Empty-grasping | 3.7294 Jiss as 
Grasping table-tennis 83.7106 | 13006 70.4 
Color-feedback 73.1875 
transparency-feedback 75.0667 
Flashing-feedback 76.7143 
Highlight-feedback 75.3571 
Rotation-feedback 71.1875 
Shake-feedback 75.1429 
Scale-feedback 71.4667 
Deformation-feedback 71.7143 
None-feedback 0.8869 | | 3.665 | 1.8639 | 66.2143 


The average EMG and reaction time of visual feedback was positively related to the 
reaction time, the greater the average EMG value is higher, but the effectiveness of the 
feedback form is small; 

The heart rate associated with visual feedback, visual feedback was more intense, 
the heart rate is higher, but the effectiveness is not high, the use of context information, 
such as presentation, warning effect, interaction effects etc.; 

The muscle fatigue and cognition, more familiar and more acceptable form of feed- 
back of the lower degree of muscle fatigue, exercise performance was better than the 
intuitive, people were more likely to perceive, but the effect was not good. 


5 Conclusion 


Through inner-group analysis and inter-group analysis, the data were analyzed by hori- 
zontal and vertical, respectively, summarized as follows: 


(1) In general, the feedback form of visual expression was superior to the performance 
of movement; 

(2) The more obvious the physiological index, the strongest the stimulus. The color 
feedback and highlight feedback performance were the best, and also had the 
highest subjective satisfaction; 

(3) The flashing feedback was the most intense stimulus, but it was not applicable in 
interactive feedback, and subjective satisfaction was low; 

(4) The deformation feedback was the most consistent cognitive approach to grasp 
gestures, and was recommended in the virtual scene collision detection; 

(5) The scale feedback was not the best in cognition, but it could reduce the user’s 
operation fatigue and reduce the user’s learning process in interactive operation; 


In summary, the form of visual feedback was suggested as follows: 

In the virtual and real interaction, it was best to use the visual expression of visual 
feedback form, mainly color, local performance and overall performance should be 
considered; 
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In scene content design, the dynamic visual feedback was more easily perceived than 
the static; 

In the multimodal input and output model, visual feedback of motor form was better 
and more in line with human cognition; 

In the multi-channel information input and output mode, the visual feedback of motor 
expression form is better and more consistent with human cognition; 

When there were a lot of visual feedback forms in the interaction of virtual and real 
interaction, priority should be given to using a consistent and consistent feedback 
method to select the correct form of visual expression and the performance of movement; 


6 Summary and Future Work 


In this paper, we studied the effects of different visual feedback forms on cognitive and 
user subjective satisfaction under the natural gesture. At the same time, we used phys- 
iological detection equipment to detect the physiological characteristics of the experi- 
ment process and concluded that visual feedback affects user reaction and user experi- 
ence to some extent, and found that there was a relationship between physical charac- 
teristics and visual feedback. 

In this paper, the study of the type of visual feedback was still at the initial stage. 
The variable level of scaling and deformation feedback was derived from the actual 
project, the selected deformation parameters needed further study. In addition, visual 
feedback had more dimensions and angles of variable types that need further study. 
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Abstract. In view of the security hidden danger of the business data in the 
logistics traceability system of aquatic product, the design of the information 
security strategy is carried out from the two angles, which are data collection and 
transmission. Firstly, the aquatic traceability code is designed. Then the data 
encryption algorithm is studied, and the data security transmission of information 
security protection system is built. In the intelligent collection of traceability data, 
QR Code is used to express the logistics trace data, then uses the improved data 
encryption algorithm to encryption storage about QR Code. In the transmission 
of traceability data, the HTTPS protocol is used to construct the client-server 
transmission encryption channel to ensure the integrity of the data. Finally those 
information security technologies about Logistics traceability data are used in a 
logistics traceability system of aquatic product, and the effectiveness of RC4-RSA 
hybrid encryption algorithm is verified. And the integrated applications enhance 
the system’s information security. 


Keywords: Logistics traceability system of aquatic product - Data security 
Data encryption technology - RC4-RSA hybrid algorithm 


1 Introduction 


In the logistics system, the importance of information security is increasing gradually 
[1]. Those technologies of information security, two-dimensional information encryp- 
tion based on random shift [2], encoding encryption key of GPS location information 
[3], secure data aggregation based on homomorphic encryption scheme [4], encryption 
the key of AES algorithm with ECC [5], sustained data protection mechanism based on 
virtual storage technology and others have been born on. For logistics traceability system 
of aquatic products, the data of aquaculture, processing, transportation and sales are 
stored in the bar codes in the form of encoding. With the development of network tech- 
nology, the encryption algorithm must be upgraded and the data storage and backup 
must be strengthen, to ensure the information security during the logistics traceability 
process. 
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2 Risks Analysis on Information Security of Aquatic Logistics 
Traceability System 


The logistics traceability system of aquatic products is a distributed processing envi- 
ronment which is a multi-module, multi-role, multi-service information system. With 
the increasing of logistics business, the security risks increase. Analyzing the hidden 
security about the logistics traceability system of aquatic products from four layer of the 
acquisition control, data, application and user, the security problems of logistics business 
data mainly exist in the two aspects of data acquisition and transmission. 

In order to improve the data security of the aquatic products traceability system, and 
ensure the integrity and availability of information, the author put forward a solution 
from the point of technology security combining with the hidden data security in the 
collection and transmission process. 

In the logistics traceability system of aquatic products, data collection is mainly 
based on bar code identification technology, wireless data transmission and wireless tag 
(RFID) technology [6]. Among them, the bar code technology is more widely used in 
logistics system than RFID technology because of its low using cost. The two-dimen- 
sional code-QR Code is widely used in the current aquatic logistics traceability system. 
Therefore, this paper only focuses on the encryption of QR Code as a representative 
research object in the data collection section. Through the encryption to QR Code, avoid 
the direct exposure of information carried by the two-dimensional code. 


3 Research on Two-Dimensional Encoding and Encryption About 
Traceability Data of Aquatic Logistics 


3.1 Logistics Data Collection Code Generation Encryption 


3.1.1 QR Code Encoding Encryption Design 
(1) QR Code encryption location selection 


Through the analysis on the encoding process of the QR Code, data can be encrypted 
at all levels before the final information is constructed, as shown in the Fig. 1. 

Analysis on different positions, encryption after the data encoding, which is related 
to the complex process of coding generation, cannot guarantee that the original structure 
is not affect by the encrypted data. The poor flexibility affect the generation of the two- 
dimensional code. Wrong coding function cannot be guaranteed. So, the paper takes the 
encryption at position | to prevent information from being tampered with criminals. 


(2) Encryption algorithm selection 


Both the security of the information and the complexity of encoding and decoding 
should be took into account in a QR Code encryption strategy. Table 1 compares several 
common cryptographic algorithms in some important reference dimensions. 
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Fig. 1. QR Code multiple encryption 
Table 1. Comparison of common encryption algorithms 
Encryption Category Operation speed | Security Resource 
algorithm consumption 


DES Middle 
AES High Low 


RC4 Symmetric 10 times faster High Low 


(Streaming than DES 
encryption) 


RSA High 


only 





Symmetric encryption algorithms suit for a large amount of data encryption because 
of its low complexity of decryption and fast encryption speed, but they also have key 
shortcomings. Asymmetric algorithms have the feature for the high security, simple 
public key public key management, but relatively slow encryption speed. 

QR Code using in the aquatic products logistics traceability, information read a large 
amount of encrypted data more, speed and security of encryption to protect data, using 
symmetric encryption algorithm and asymmetric encryption algorithm combination. 
Through the comparison of common algorithms in 3—1, this paper uses RC4 algorithm 
and RSA algorithm for QR Code hybrid, the original plaintext data is encrypted by RC4 
algorithm, improve the speed of encryption and decryption of the RC4 algorithm, key 
is encrypted using the RSA algorithm, to ensure the security of the key, solve the problem 
of key distribution and management, further to improve the security of encryption. 


(3) Encryption scheme for QR Code 


Use RC4 and RSA algorithms in the encoding link of QR Code, to encrypt the data. 
The QR Code encryption process is shown in Fig. 2. 
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Fig. 2. QR Code encryption process 


The QR Code decoding process is a reverse process of encoding, decryption process 
of the decryption algorithm is shown in Fig. 3. 
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Fig. 3. QR Code decryption process 


Through the mixed encryption with RC4 and RSA algorithm, the speed and security 
are improved while the security requirements of QR Code encryption and decryption 
are ensured. When QR Code is used in the aquatic traceability system, the aquatic enter- 
prise uses the RSA algorithm to generate the public key and the private key, in which 
the public key can be made public, and the private key is reserved inside the enterprise 
for encryption. In the QR Code generating process, according to the encrypted data, 
enterprise distribute RC4 key or randomly generate, guarantee different QR Code using 
different encryption keys, dynamically generated QR Code encryption, and improve 
system security. The logistics staff read or write information by professional equipment 
with QR Code decryption algorithm, input the information to aquatic product traceability 
system, and data acquisition and encryption to achieve traceability to logistics supply 
chain, enhance the security of information. 


3.1.2 Chaos Improvement of RC4 Algorithm 

The QR Code encryption program for RC4 algorithm encryption is to take the XOR 
(exclusive OR) operation, once the sub-password is repeated, it is easy to be cracked. 
In addition, the randomness and ergodicity of the algorithm is not good enough, and it 
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is a low-diffusion state. Therefore, the RC4 algorithm has some weak keys and is 
vulnerable to attack. 

In order to improve the security of RC4 algorithm in QR code encryption, the 
Logistic chaotic map is used to improve the randomness of pseudo-random sub-code 
generation, which make a better diffusion effect and reduce the occurrence of weak key. 
The so-called chaos is the disorder and random phenomenon in a deterministic system, 
the chaotic sequence generated by the chaotic system has unpredictability [7]. 

Logistic one-dimensional mapping which is widely used is a chaos mapping of 
mathematical form, the mathematical expression is: 


X; = pX;(1 — X;) (2.1) 


The ranges of values for the initialization parameter u and X) are: 
O<"u<4,0<X%) <1. 

When 3.5 < u < 4 the system is in a discrete state, that is, reaching the chaotic state. 
The value X, have the characteristics of non-convergence and non-periodic, the value 
can traverse the whole interval of (0, 1], the sequence generated outside the interval will 
converge to a specific value. The closer the p is to 4, the better the proliferation. 

After a large number of experiments, the parameters used in this scheme are: 


X, = 0.8755, u = 3.99919. The chaotic state reached at this time is shown in Fig. 4. 


logistic(u=3.99919) 


Fig. 4. The chaotic mapping when X, = 0.8755, u = 3.99919 


After the chaotic map, the RC4 algorithm is improved as follows: 


© Calculate the chaotic value based on the setting initial value of Logistic map. 


y = HX, (1 - X9) (2.2) 


@ Generate the random key sequence by substituting the chaotic value and iteration in 
the RC4 key scheduling algorithm. 
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y=u*x(1-x); 


X=y; 
j=Gt+SLJ+TLJ+ y*256)mod256; 


swap(sl1],s[j]); 





© Recursive execution after adding chaotic maps during the process of generation and 
replacement of Pseudo-random sub-code. 


1=(+1) mod 256; 
y=u*x(1-x); 

X=y; 
j=Gt+S[i]+y*256)mod 256; 
swap(S[1],S)D); 


t=(S[i]+S[j])mod 256; 


key[r]=S[t]; 





Do the XOR operation for the sequence generated by the third step with the plaintext, 
until the algorithm encryption step is completed. And the randomness of the random 
code generation is greatly improved after using the Logistic mapping. 

The encryption security of QR Code information is improved by RC4 based on 
chaotic mapping and RSA algorithm, which are used to encrypt and decrypt the QR 
Code. It can effectively prevent the forgery and tampering of logistics business data and 
ensure the security of data in data acquisition and generation rooting. 


4 Integrated Application of Aquatic Logistics Traceability Data 
Security Technology 


In this example, the aquatic logistics traceability system is based on the background of 
an aquatic product processing enterprises in Hubei Province, and it is constructed 
according to the actual needs of the enterprises. The two-dimensional bar code is used 
to collect the information intelligently and realize the resource sharing of aquatic prod- 
ucts. In the event of aquatic products quality problems, a quick inquiry to trace the 
relevant breeding distribution information can found where the problem is through the 
information contained in the two-dimensional code. 
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From the logistics business data security considerations, use QR Code data encryp- 
tion to protect the security of data in the supply chain information transmission, and the 
HTTPS protocol to ensure the safety of data transmission channel. 


4.1 Application of QR Code Encryption Technology in Aquatic Logistics 
Traceability System 


4.1.1 Coding Design for QR Code in Aquatic Traceability 
The core of aquatic logistics traceability is the construction of unique traceability code. 
Through the traceability coding rules, QR Code is used as label carrier to cover aquatic 
raw materials, breeding, processing and distribution process. 

Combine the aquatic enterprises, product categories, production and breeding infor- 
mation organically to construct the only aquatic traceability code by using sub-rules. 
The aquaculture tracing a total of 24, is as shown in Fig. 5. 


Vendor Aquatic product Unique traceability code 
identification code identification code 


|_| Batch dass mumber Serial number 


NiNNsNy NsNeN-NsNGNio NuNi2Nis NaisNpsNigNizNisNis N dN NN 





Fig. 5. Aquatic traceability code 24 bit structure 


A unique identification of aquatic traceability code must be constructed, in order to 
truly achieve the retrospective query of aquatic logistics, processing, distribution and 
other information. 


4.1.2 Encryption Implementation of Aquatic Traceability QR Code 

QR Code stores the processing and distribution information of aquatic products. If the 
information is directly exposed to the two-dimensional code, it is easy to cause the 
information to be tampered with, so the QR Code information needs to be encrypted 
when printing. According to the distribution processing part of the trace code to generate 
encrypted QR Code. 

Through the processing and distribution tracking code “123401020300216090 
800087”, the processing phase of the relevant information can be queried. According to 
the pre-assigned RC4 algorithm key “txttest” click to encrypted QR code can be gener- 
ated by clicking to the pre-assigned RC4 algorithm key “txttest”. During the QR Code 
encryption process, the aquaculture enterprise retains the RSA unique private key for 
encrypting the RC4 key to make an intermediate key. The staff will posted the generated 
QR Code on the corresponding aquatic packaging to make user-friendly query. 

When the aquatic products transported to the processing site, the staff get the aquatic 
product details through scanning the QR Code by the built-in decryption program PDA/ 
mobile terminal. The built-in the RSA public key of aquatic enterprise is opening in the 
handheld terminal program, to convenient for staff or consumers to scan and decrypt. 
Using the RSA public key in the decryption program to decrypt the RC4 algorithm 
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intermediate key formed in the encryption process, and according to the decrypted RC4 
key to obtain the original of the processing information. Using of ordinary scanning 
software without decryption procedures and RSA public key, can just access to a mean- 
ingless garbled. 

The effect before and after the QR Code encryption is shown in Fig. 6. 


Before encryption: After encryption: 


Before encryption: Processing and 

distribution tracking code: LUTE ADS HEE AE e 
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Delivery time: 2016-09-20 Ú "öx. KAR BREU 

















Raw material: grass Carp BBR EIEDT 2(& 


: a Poe ANF 
For moreinformation visit: WWW 





Fig. 6. QR Code encryption effect diagram 


4.2 Result Analysis 


4.2.1 Stochastic Analysis of the Improved RC4 Algorithm 
For the improvement of RC4 algorithm in the QR Code generation encryption process, 
this paper adopts frequency test and run length test to verify the random performance 
of the improved algorithm. 

The measure of randomness is P-Value, and statistic X obeyed y ^ 2 (n) distribution, 
significance level « € [0.001, 0.01]. When P-Value > a, then the sequence is random, 
which take a = 0.01. 


(1) Frequency Test 


The frequency test is to test the proportion of 1 and 0 in the entire random sequence, 
and whether the two are close. 
The frequency test formulas are: 


P — Value = erfc{ ~~ },x = IS |/ vn (3.1) 
2 

erfc(x) = = J i edu 32 

ai. (3.2) 


erfc(x) is the complementary error function, and S, is the result of the addition of the 
random sequence to —1 and 1. By running the RC4 algorithm in MATLAB to obtain 
the random sequence, and then conversion and addition to get S,, the value of S, into 
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the formula, get P — Value, as shown in Fig. 7. It can be seen from the figure 
P — Value > 0.01, in line with the requirements of the ideal random sequence. 


% 100 200 300 400 500 600 700 800 900 1000 % 100 200 300 400 500 600 700 800 900 1000 
n n 


Fig. 7. Frequency test Fig. 8. Run test 


(2) Run Test 


Run test is a coherence test, the run is a two-part total, in the RC4 randomness test 
it is the uninterrupted sub-sequence constituted by the same bit sequence. The purpose 
of the run test is to determine whether the number of runs of 0 and 1 is consistent with 
the random sequence (Fig. 8). 

The run test formula 1s: 


|V (obs) — 2nz(1 — x)| 
21/2nx(1 — 2) 


V (obs) represents the sum of all O and 1 values in the random sequence of the RC4 
algorithm, and n is its length. Replace the relevant parameters in MATLAB, to calculate 
P — Value. As can be seen from Figs. 3 and 4, P — Value > 0.01. 

Through the frequency test and run test, the P — Value obtained is greater than the 
NIST specified 0.01, in line with the requirements of the ideal random sequence. 


P — Value = erfc(x), x = (3.3) 


5 Conclusion 


Under the background of supply chain logistics system, this paper analyzes the hidden 
dangers of aquatic logistics traceability system in business data, and constructs data 
acquisition and coding encryption. For the RC4 algorithm based on chaotic mapping is 
used in QR code encryption process, the author made the randomness test, and the test 
results show that the randomness is in accordance with NIST. 

Through the QR Code encryption technology, using different RC4 key, the dynamic 
encryption of aquatic supply chain traceability can be achieved, fully guarantee the 
uniqueness of aquatic logistics data in the whole supply chain system, and avoid the 
forgery of others, improve the safety of aquatic system information from the root. 
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But the study only QR Code as a representative of the logistics information carrier 
research, Future research can also expand the data acquisition information carrier object, 
from the grid intrusion detection, trusted computing point of view, more adequate 
protection of the logistics system information security. 
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Abstract. By analyzing the factors affecting the airborne mission system, this 
paper applied the method of Quantified Self to the evaluation of human effec- 
tiveness in the military airborne mission system. According to the depth of inter- 
action between people and information, we divide the information circumstance 
into four aspects including individual, equipment, network and environment. 
Then we construct a complete individual Quantified Self information interaction 
system by collecting physiological data, cognitive data, behavioral data and envi- 
ronmental data. Finally, the functional architecture and composition of the ergo- 
nomic evaluation platform are given in combination with the airborne mission 
system. 


Keywords: Quantified Self - Complex information system - Big data 


1 Introduction 


With the development of information technology such as the internet of things, cloud 
computing, mobile Internet, “Natural Interaction” and “Intelligent Decision” have 
become important concepts in various information system. Developing of the informa- 
tion technology also lead to the explosive growth of data, which has had a profound 
effect and even gradually change the original knowledge production mode and cognitive 
framework. Big data analysis has become an integral method in many fields, and one of 
the big data analysis trend is Quantified Self (Fig. 1). 

The airborne cabin is an important campaign environment in high-tech war. The 
working space of the cabin is airtight and narrow, and the operation environment is 
complex. Noise, vibration, electromagnetic filed will affect the working condition and 
operation ability of operators to varying degrees, which will affect the combat effec- 
tiveness. So, the main problems we have to face are: 


(1) Large amount of data: With the development of sensors, the amount of data will 
become larger, the types of data will become more diverse and the update speed of 
data will become faster. 
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Fig. 1. Conception of future combat cabin 


(2) High complexity of tasks: There are many kinds of tasks, such as searching, 
tracking, and monitoring. The information capacity of these tasks interface 1s enor- 
mous, and the structural relationship of these tasks is complicated, and these tasks 
have the characteristic of uncertain and difficult to predict, which is easy to cause 
the disorder of cognitive and affect the judgement of the operator. 

(3) Change of the working mode: The working mode of the operators is mainly 
converted from operation to surveillance and decision making, which results in a 
sharp increase in imbalances between the human cognitive and the interface infor- 
mation encoding [2]. 


The information warfare requires both commanders and operators to perform tasks 
accurately and in real-time. It is a key link for improving the individual performance 
and the team decision-making ability to establish a real-time and high-precision state 
awareness model for combatants, which can detect the cognitive state of the commanders 
and soldier combatant states in real time and adjust them in a timely manner. At the 
same time, the information warfare also requires more dominant, autonomy and intel- 
ligent. Building a decision support system that provides operators with inferences about 
current and upcoming behavior and assigning tasks autonomously between operators 
and system. 


2 Related Work 


Quantifying self, as one of the big data analysis methods, can discover valuable infor- 
mation and implicit relationships in human-computer-environment system by analyzing 
various data associated with human activities, which can help to improve current states. 

The emergence and development of the concept of “quantifying self’ was only a few 
years ago and mainly in the field of health. It is used to track and record individual 
behaviors such as sports, sleep, diet, mood by using multiple device such as computers, 
portable sensors and smart phone. Then the collected data were used to discover valuable 
information in fitness, daily physiology and diseases treatment which can be used to 
improve the people’s living condition. With the development of micro sensors and 
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intelligent mobile technology, quantifying self will become an important method for 
ergonomics research, information visualization and human-computer interaction 
research. 

At present, western countries led by the United States are in absolute leading position 
in the research field of complex systems. The key technology program of United States 
Department of Defense not only lists human-computer interaction interfaces as an 
important part of software technology development, but also specially adds human- 
computer system interface which is juxtaposed with software technology. A new project 
of the DARPA hopes to improve the cognitive ability by changing the information 
display mode, which can improve operators’ efficiency, and the cognitive ability is 
reflected by the cognitive enhancement. The US Navy Research institute used eye 
tracking technology to solve problems in the ship pilot train [6]. The following work 
had been done: 


(1) The researchers organized a large number of experienced crew and ordinary 
students to simulate operations such as navigation, berthing and emergency avoid- 
ance. And the researchers collected eye movements, behavioral data and some 
electrophysiological data from them throughout the missions. 

(2) The researchers identified different eye movements patterns between experienced 
crew and ordinary students by detailed analyzing above data, and established a 
quantitative model of eye movements pattern. 

(3) The model is applied to training students. By monitoring the eye movement data 
of students, the system can automatically calculate whether the students deviate 
from the correct mode of operation and give timely feedback according to the 
model. 


At present, the system has been applied to the simulation training of American Navy 
drivers, and has achieved good results. The study of human efficacy has entered a new 
stage in China, too. Quantitative research, as an important method to study human 
efficiency and user experience, has been carried out in many fields such as ships, tanks, 
aircraft driving and civil aviation control. User experience is also attached great impor- 
tance in the development of civil information systems such as the Internet. A large 
number of quantitative research has been carried out. For example, Wen-jun Hou and 
Xiao-yu Gao of the Beijing University of Posts and Telecommunications studied a 
method based on measurement of pupil satisfaction. [4] This paper analyzed the rela- 
tionship between the pupil size and the user satisfaction in usability tests. And they 
discovered that there is a linear significant negative correlation between the user satis- 
faction and user’s pupil diameter standard deviation rate, and build a model to describe 
this relationship. 


3 Classification of Quantified Self Data 


According to the depth of interaction between people and information, we divide the 
information circumstance into four aspects including individual, equipment, network 
and environment [3]. Then we construct a complete individual Quantified Self 
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information interaction system by collecting physiological data, cognitive data, behav- 
ioral data and environmental data [3]. 

The physiological data include height, weight, blood oxygen, blood pressure, heart 
rate, etc. These data reflect the health condition of the operator. The harsh cabin work 
environment, limited space, noise, vibration, electromagnetic fields have a serious 
impact on physical and mental health of operators. The narrow space makes the operator 
stay in a fixed position for a long time, which will lead to blood in the legs, causing 
swelling, stiffness and discomfort. Operators not only have to face complex and diverse 
information, but also perform complex tasks, which can easily lead to fatigue. According 
to the corresponding physiological indexes, work intensity are reasonably arranged so 
as to achieve the best operation safety and performance. 

The cognitive load caused by the mental energy consumption in the process of infor- 
mation processing has become an important indicator of the reliability evaluation of 
human-computer interaction system. Many researchers have realized that the cognitive 
load is a very serious threat for job performance and operational safety. Facing the large 
screen display and the global visual interface of beyond sight distance, in order to obtain 
the situation of the whole system, the operator must enter the interface of different func- 
tions through the interface management task to master all the information needed. When 
cognitive load is overloaded, the operator cannot successfully complete the task due to 
the insufficient supply of cognitive resources, which leads to decreased accuracy, 
prolonged reaction time and eve accidents. However, the effective control of cognitive 
load does not mean to blindly reduce the cognitive load. Because when the cognitive 
load is too low, especially when in a monotonous boring situation, the operator is likely 
to become unresponsive and lead to more errors. 

The environmental data, such as temperature and humidity, will have a great impact 
on the operators. For example, if under insufficient illumination condition, operators 
need repeated identification. The vision continues to decline, cause eye fatigue and even 
systemic fatigue. Human beings have different behavioral responses in human-computer 
interaction in different environments. The environment affects not only people’s 
working ability, and also the performance and reliability of the machine. On the other 
hand, people and machines also affect the state of the environment. Therefore, the envi- 
ronmental factors are no longer excluded as a passive interference factor outside the 
system, but as a positive active factor into the system, and become an important part of 
the system research. 

Behavior is the process of conscious decision making. Behavior is influenced by 
many factors, such as personal habits, emotions, brain state, and so on. Under long-term 
training, the operators’ proficiency will gradually increase along with the efficiency and 
speed. During the process, they form individual operation habits combine with their 
inherent personalities. It’s necessary to meet the common habits of people to improve 
the speed and also need to compatible with the differences of habits to reduce the diffi- 
culty of operation. At the same time, the high efficiency behavior pattern is certainly 
need to be promoted, and inefficient behavior patterns should be avoided by the operator. 
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4 The Ergonomic Evaluation Platform 


We build up a scientific assessment of environmental ergonomics based on large data 
mining, machine learning and artificial intelligence technology to solve the problems of 
small data in small world. Through the ergonomic evaluation platform, we apply the 
methods of Quantified Self to collect operators’ overall data which is based on objective 
data and supplemented by subjective data. Through the machine learning, a compre- 
hensive multidimensional assessment model of the complex system is proposed with 
the use of continuous accumulation of data and the mediated by working load (physio- 
logical load and cognitive load) (Fig. 2). 
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Fig. 2. Ergonomic evaluation platform 





The system combines eye movement with motion capture and physiological synchro- 
nization technology. It uses wearable data acquisition technology and wireless data 
transmission technology to let operators can completely free from any interference when 
collecting data. Human-environment synchronization platform is capable of: Recording, 
tracking and analyzing real-time psychological and physiological changes of users, such 
as ECG, skin electricity and other indicators; analyzing people’s attention, cognitive 
load, fatigue; Synchronizing real-time records involved human behaviors, equipment 
running information and the environmental information, such as temperature and 
humidity. 

The system collects and analyzes the causal relationship between the changes of 
human-computer-environment interaction, and realizes real-time simultaneous 
recording and analysis of human-environment data. Therefore, it accomplishes multi- 
dimensional data visualization analysis. At the same time, for the effective data extrac- 
tion and visualization presentation, we improve the human-computer interaction inter- 
face and system process to build an effective monitoring and evaluation methods. 
According to the different design for the information display interface of the aircraft 
system, the indicators such as the physiology and eye movement of the operator are 
analyzed. Meanwhile, the time, the number of occasions and the path of the operation 
task are explored and analyzed. We found out the good and bad of the operation interface 
and tried to improve the interface to become a standard of intelligent human-computer 
interface. 
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5 Conclusion 


Quantified Self combining with big data technology can be applied to the evaluation of 
complex information system, which greatly promotes the development of China’s mili- 
tary strategy. It not only can improve the airborne information system interface and the 
operation efficiency but also provides more possibilities in real-time monitoring, combat 
multi-distributed data acquisition and other aspects. Furthermore, sufficient and compre- 
hensive data contribute to display interface design, flexible layout and adaptive recon- 
figuration. The system makes an effort on establishing a real-time and high-precision 
model of combat personnel state perception based on physiological data and neural 
calculation real-time analysis. Moreover, it increases the accuracy of automatic online 
analysis based on deep learning training. 
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Abstract. Speech recognition technology is designed to be used in the logistics 
picking system. Picking personnel start picking receive voice command from 
system, he feedbacks “check code” by voice, finish man-machine dialogue, 
choose terminal equipment to capture the “checksum” for identification. After 
successful verification, complete the sorting operation. When speech recognition 
technology is used in logistics picking system, walking time can be reduced, the 
work flow will simplified and the data transmission accuracy and picking effi- 
ciency will be improved, it is conducive to promote logistics service level, 
enhance the economic efficiency of logistics enterprises and customer satisfac- 
tion. 


Keywords: Speech recognition technology - Logistics picking system 
Check code - Picking efficiency 


1 Introduction 


Order picking based on product barcode is widely adopted in automatic logistics and 
warehouse management systems. However, barcode order picking presents several 
disadvantages when the order number increases significantly. Firstly, as the number of 
order increases, system capacity does not scale-up easily, leading to delay in order 
delivery. Secondly, currently barcode order picking systems may still rely heavily on 
human labors. As a result, the entire system becomes error-prone and inconsistent in 
system efficiency. In the present paper, we propose a voice recognition based order 
picking system. It is our contention that the introduction of voice recognition and voice 
based control system can significantly reduce the physical activity at the order picking 
production line. This proposed approach, therefore, is applicable in labor intensive and 
high throughput retail warehouse/distribution and logistics centers, in particular, those 
that are not equipped with automatic and semiautomatic systems. 


2 Speed Recognition Technology 


Automatic Speech Recognition (ASR) refers to a whole raft of technologies that aim to 
collect and process human speech and digitise such data into computer understandable 
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and actable data and/or instructions. Based on retrained models, computers should be 
able to performed the entire process without or with very limited human intervention. 

Upon receiving human speech signal, ASR systems process the data based on 
features such as ambient/background signals, timbre, pitch, etc. Figure 1 depicts the key 
components of an ASR system. An ASR system normally consists of 5 modules, prepro- 
cessing, feature extraction, model training, model scoring and selection, and post- 
processing. The workflow can be largely divided into model training and model appli- 
cation. 


(1) Preprocessing module: Speech signal includes filtering, A/D conversion, filtering, 
demonising, speech enhancement, signal smoothing, end-point detection, etc. 

(2) Feature extraction: After preprocessing, speech signals are subject to time- 
frequency analysis, Cepstral analysis and wavelet transformation. This is to extract 
from the signal data such features as: timbre, language, and voice contents. 

(3) Model training: Features extracted by the previous module are then fed into training 
algorithms to obtain language models. Model training normally is an iterative 
process wherein features are evaluated, re-processed and optimised so as to 
construct models that can reflect all features of the input data. Trained models are 
stored in a model library. 

(4) Model scoring and selection: When ASR system is in application mode, features 
extracted from speech signal are utilised to identify and retrieve the best model 
from the model library. A predefined scoring scheme can guide the selection of best 
matching models. 

(5) Post-processing: Natural languages are ambiguous. In order to improve the system 
performance, it may be necessary to introduce linguistic and semantic analysis to 
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Fig. 1. Chipset for speech recognition 






3 Chipset for Speech Recognition 


The envisaged use case of the proposed system is that the speech recognition system 
should be a portable device that the order picking staff can easily carry along without 
hindering their normal work routines. Such a scenario derives a plethora of requirements 
on speech processing speed and accuracy, device portability, device energy consumption 
profile. In the present paper, we focused on local dialects and opt for non-specific speech 
recognition chip. We also tuned the model library based on the core user population, 
ambient and environmental noise, and typical use cases of the distribution centre where 
the system will be evaluated and put into production. 
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Currently, widely used non-specific speed recognition chips include ASR M08, 
LD3320 and LingYang 61A. Such chips do not require pre-recording of speech signals 
for initialisation and validation and are suitable for our envisaged use case [1]. 

LD3320 speech recognition module is a specialised processor with external circuits 
including AD, DA converters, microphone interface, audio output interface, and parallel 
and serial interface. It benefits from small physical size and low power consumption, 
lending itself to mobile applications. LD3320 embeds speech recognition module trained 
based on a large amount of speech data. It provides an off-the-shelf solution for many 
use cases. LD3320 provides a versatile of external control and interfaces, including 
dynamic editing of the recognised keywords. LD3320, therefore, facilitates further 
development of speech recognition functionalities and customisation against specific 
application context [2]. 

End users of the proposed speech recognition system are order picking staff in labour 
intensive manual warehouse/distribution centres. The speech recognition system is 
expected to be worn by the users while they are moving between aisles, staging areas, 
order picking lines and storage spaces. It, therefore, should be light weight, low energy 
consumption, easy to carry and highly responsive [3]. Meanwhile, as the staff turnover 
rate is very high in such a working environment, the speech recognition module should 
be non-specific [4]. 

LD3320 presents the following key features: 


(1) Integrated Flash and RAM. This eliminates the needs for external storage and thus 
reduce the complexity and cost of the system and overall energy consumption 
profile. 

(2) Embedded speech recognition models. The on chip models are already suitable for 
many generic application scenarios. It therefore presents a low learning curve for 
adoption. 

(3) Parameter tuning on distance and sensitivity. 

(4) No requirement on prior training and recording. 

(5) Dynamic keyword editing allows easy extension and adaption to new scenarios. 

(6) High accuracy rate (95%) against a list of as many as 50 keywords. 

(7) Integrated A/D, D/A convertors, Integrated amplifier circuit and a 550 mW speaker 
interface for playback; parallel and serial interface for further development. 

(8) Can easily switch between sleep and activate states to reduce energy consumption. 

(9) Working power supply of 3.3 V. LD3320 can be powered by three AA batteries to 
meet the power supply needs for portable systems. 


LD3320 ICRoute is based on keyword recognition. Figure 2 illustrates key compo- 
nents of LD3320. 

LD3320 collects speech signal through the integrated microphone (MIC). The signal 
is then subject to spectrum analysis, feature extraction and feature engineering so as to 
be ready for keyword extraction. A trained speech recognition model will then be applied 
to the processed signal, outcome of which are candidate words which will be compared 
against a predefined list provided to the system. During the final step, the system will 
compute a score for each candidate keywords and output the one(s) with the highest 
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Fig. 2. The working principle of speech chip 





score(s). Human intervention is possible by dynamic keyword editing through Micro 
Control Unit (MCU). 
LD3320 is equipped with two speech signal collection and recognition modes: 


(1) Fixed Interval based: Based on a predefined interval (e.g. 3 s), LD3320 collects and 
records all speech signal data during the given time window. At the end of a time 
window, data collection is put on hold and keyword detection is only performed on 
signals collected within the time window. 

(2) Continuous analysis: LD3320 leverages a voice activity detection technology 
(VAD) to identify the beginning and end of human speech out of environmental 
and ambient sound. All speech signals between these two points are collected, upon 
which further processes are applied. 


Due to the ambiguity caused by multiple matchings, when processing input speech 
signals, LD3320 considers any matching as interim results and does not output the 
identified keywords until the input speech signal stops (i.e. the end point of human 
speech is identified or a predefined time interval ends). When a signal ending condition 
is reached, LD3320 deems the input signal complete and thus the optimal keyword 
matching results are returned. For instance, the keyword list may contains “201” and 
“2017”. When processing speech signal, when “1” is detected, the best matching 
keyword is “201” with “2017” as a candidate with high probability. LD3320 will proceed 
as follows: 


(1) if input signal ends, output “201”; 

(2) if input signal continues and “7” is detected, recomputes “2017” as the best match. 
In the end, 1f not speech signal after “7”, “2017” is deemed best matching keywords 
and returned to end users. 


4 Application of Speech Recognition in Order Picking 


Speech recognition enabled order picking proceeds as follows: 


(1) order picking staff initiates the device by instructing “start operation’. Once 
receiving the initiation instruction, the portable device starts to “read” out the region 
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number, aisle number, and shelf number of the first order. It also instruct the order 
picking staff to read out bar code of the picked items. 

(2) When the item is located, order picking staff reads the barcode. The portable 
terminal compares recognised barcode with the one in the picking order. If matches, 
portable terminal will pronounce quantity of item for the order; otherwise, portable 
terminal will read out location information again. 

(3) When the current order is finished, order picking staff instruct the device with “task 
complete”. The portable device changes the status of the current order and moves 
to the next one in the order queue. If the order queue is empty, the portable terminal 
pronounces “operation complete”. 


Barcode comparison in step two acts as a key checking point. This is to avoid poten- 
tial cost introduced by human errors. If the barcode does not match, order picking staff 
will be remaindered the correct location information or provided means to manually 
double-check the order details. 

The envisaged order picking scenario limits the keyword search space to reduce 
system complexity. As shown above, interactions between human order picking staff 
and portable terminal are restricted to the following keywords: “start operation’, 

“repeat”, “task complete” and digits based item barcodes. Apart from barcodes, the other 
instruction speech patterns can be preloaded, tuned and stored locally in the portable 
terminal to enable offline non-specific speech signal recognition. Terminal and server 
communication is only needed for handshaking, system initialisation, and downloading 
order details. Downlink and uplink bandwidth can be kept to minimum and thus improve 
system performance by reducing network latency. 


5 Evaluation and Discussion 


A preliminary evaluation and comparative study has been carried out. Figure 3 demon- 
strates the difference between conventional and speech recognition based order picking 
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Fig. 3. The difference between conventional and speech recognition based order picking 
workflows 





It is evident that our proposed solution can significantly reduce physical activities 
that are inevitable in conventional approach based on barcode or RF scanning. Reduced 
physical activity leads to improved work efficiency and reduced human errors due to 
fatigue and negligence. Speech recognition also simplified the overall workflow from 
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original six steps to three by integrating and removing such activities as operating scan- 
ners. The reduced steps also contributed to cost reduction through paperless work space, 
per order time reduction, and reduced error-rate. 

The speech recognition based terminal optimises human system interface. The 
underlying system efficiency needs to be based on route optimisation and shelving and 
stock optimisation which are beyond the scope of this paper [5]. 


6 Conclusion 


The application of speech recognition in warehouse and logistics is beneficial in the 
following aspects: 


(1) Significantly reduce travel distance of order picking staff 
(2) Simplify the order picking workflow 

(3) Reduce errors due to human factors 

(4) Increase efficiency of order data broadcasting 


Jointly, the above lead to increased efficiency of order picking and thus reduced 
overall cost. 
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